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Foreword 



The papers contained in this volume were presented at the 12th Annual Sym- 
posium on Combinatorial Pattern Matching, held July 1-4, 2001 at the Dan 
Panorama Hotel in Jerusalem, Israel. They were selected from 35 abstracts sub- 
mitted in response to the call for papers. In addition, there were invited lectures 
by Aviezri Fraenkel {Weizmann Institute of Science), Zvi Galil (Columbia), Rao 
Kosaraju (Johns Hopkins University) , and Uzi Vishkin (Technion and U. Mary- 
land). This year the call for papers invited short (poster) presentations. They 
also appear in the proceedings. 

Combinatorial Pattern Matching (CPM) addresses issues of searching and 
matching strings and more complicated patterns such as trees, regular expres- 
sions, graphs, point sets, and arrays, in various formats. The goal is to derive non- 
trivial combinatorial properties of such structures and to exploit these properties 
in order to achieve superior performance for the corresponding computational 
problems. On the other hand, an important aim is to analyze and pinpoint the 
properties and conditions under which searches can not be performed efficiently. 

Over the past decade a steady flow of high quality research on this subject has 
changed a sparse set of isolated results into a full-fledged area of algorithmics. 
This area is continuing to grow even further due to the increasing demand for 
speed and efficiency that stems from important applications such as the World 
Wide Web, computational biology, computer vision, and multimedia systems. 
These involve requirements for information retrieval in heterogeneous databases, 
data compression, and pattern recognition. The objective of the annual CPM 
gathering is to provide an international forum for the presentation of research 
results in combinatorial pattern matching and related applications. 

The first 11 meetings were held in Paris, London, Tucson, Padova, Asilomar, 
Helsinki, Laguna Beach, Aarhus, Piscataway, Warwick, and Montreal, over the 
years 1990-2000. After the first meeting, a selection of papers appeared as a 
special issue of Theoretical Computer Science in volume 92. The proceedings of 
the 3rd to 11th meetings appeared as volumes 644, 684, 807, 937, 1075, 1264, 
1448, 1645, and 1848 of the Springer LNCS series. Selected papers of the 12th 
meeting will appear in a special issue of Discrete Applied Mathematics. 

The general organization and orientation of the CPM conferences is coor- 
dinated by a steering committee composed of Alberto Apostolico (Padova and 
Purdue), Maxime Crochemore (Marne-la-Vallee) , Zvi Galil (Columbia) and Udi 
Manber (Yahoo. 
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Regular Expression Searching over Ziv— Lempel 
Compressed Text 



Gonzalo Navarro* 

Dept, of Computer Science, University of Chile 
Blanco Encalada 2120, Santiago, Chile 
gnavarroSdcc .uchile . cl 



Abstract. We present a solution to the problem of regular expression 
searching on compressed text. The format we choose is the Ziv-Lempel 
family, specifically the LZ78 and LZW variants. Given a text of length u 
compressed into length n, and a pattern of length m, we report all the R 
occurrences of the pattern in the text in 0(2’" + mn + Rmlogm) worst 
case time. On average this drops to 0{m? + {n + R) logm) or 0{m? + 
n + Ru/n) for most regular expressions. This is the first nontrivial result 
for this problem. The experimental results show that our compressed 
search algorithm needs half the time necessary for decompression plus 
searching, which is currently the only alternative. 



1 Introduction 

The need to search for regular expressions arises in many text-based applications, 
such as text retrieval, text editing and computational biology, to name a few. 
A regular expression is a generalized pattern composed of (i) basic strings, (ii) 
union, concatenation and Kleene closure of other regular expressions p. The 
problem of regular expression searching is quite old and has received continuous 
attention since the sixties until our days (see Section^J. 

A particularly interesting case of text searching arises when the text is com- 
pressed. Text compression Q exploits the redundancies of the text to repre- 
sent it using less space. There are many different compression schemes, among 
which the Ziv-Lempel family is one of the best in practice because of 

its good compression ratios combined with efficient compression and decompres- 
sion times. The compressed matching problem consists of searching a pattern 
on a compressed text without uncompressing it. Its main goal is to search the 
compressed text faster than the trivial approach of decompressing it and then 
searching. This problem is important in practice. Today’s textual databases are 
an excellent example of applications where both problems are crucial: the texts 
should be kept compressed to save space and I/O time, and they should be effi- 
ciently searched. Surprisingly, these two combined requirements are not easy to 
achieve together, as the only solution before the 90 ’s was to process queries by 
uncompressing the texts and then searching into them. 

* Partially supported by Fondecyt grant 1-990627. 
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Since then, a lot of research has been conducted on the problem. A wealth of 
solutions have been proposed (see Section^J to deal with simple, multiple and, 
very recently, approximate compressed pattern matching. Regular expression 
searching on compressed text seems to be the last goal which still defies the 
existence of any nontrivial solution. 

This is the problem we solve in this paper: we present the first solution 
for compressed regular expression searching. The format we choose is the Ziv- 
Lempel family, focusing in the LZ78 and LZW variants Given a text 

of length u compressed into length n, we are able to find the R occurrences of 
a regular expression of length m in 0(2"* + mn + Rmlogm) worst case time, 
needing 0(2"* + mn) space. We also propose two modifications which achieve 
0{m? + {n + R) log m) or 0{m? + n + Ru/n) average case time and, respectively, 
0(m+n log m) or 0(m+n) space, for “admissible” regular expressions, i.e. those 
whose automaton runs out of active states after reading 0(1) text characters. 
These results are achieved using bit-parallelism and are valid for short enough 
patterns, otherwise the search times have to be multiplied by [m/w], where w 
is the number of bits in the computer word. 

We have implemented our algorithm on LZW and compared it against the 
best existing algorithms on uncompressed text, showing that we can search the 
compressed text twice as fast as the naive approach of uncompressing and then 
searching. 



2 Related Work 

2.1 Regular Expression Searching 

The traditional technique to search a regular expression of length m (which 
means m letters, not counting the special operators such as "I ", etc.) in 
a text of length u is to convert the expression into a nondeterministic finite 
automaton (NFA) with 0{m) nodes. Then, it is possible to search the text using 
the automaton at 0{mu) worst case time. The cost comes from the fact that 
more than one state of the NFA may be active at each step, and therefore all 
may need to be updated. 

On top of the basic algorithm for converting a regular expression into an 
NFA, we have to add a self-loop at the initial state which guarantees that it 
keeps always active, so it is able to detect a match starting anywhere in the 
text. At each text position where a final state gets active we signal the end point 
of an occurrence. 

A more efficient choice Q is to convert the NFA into a deterministic finite 
automaton (DFA), which has only one active state at a time and therefore allows 
searching the text at 0{u) cost, which is worst-case optimal. The cost of this 
approach is that the DFA may have 0(2"*) states, which implies a preprocessing 
cost and extra space exponential in m. 

An easy way to obtain a DFA from an NFA is via bit-parallelism, which is 
a technique to code many elements in the bits of a single computer word and 
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manage to update all them in a single operation. In this case, the vector of 
active and inactive states is stored as the bits of a computer word. Instead of 
(ala Thompson examining the active states one by one, the whole computer 
word is used to index a table which, given the current text character, provides 
the new set of active states (another computer word). This can be considered 
either as a bit-parallel simulation of an NFA, or as an implementation of a DFA 
(where the identifier of each deterministic state is the bit mask as a whole) . This 
idea was first proposed by Wu and Manber 

Later, Navarro and Rafhnot used a similar procedure, this time using 
Glushkov’s Q construction of the NFA. This construction has the advantage of 
producing an automaton of exactly m -I- 1 states, while Thompson’s may reach 
2m states. A drawback is that the structure is not so regular and therefore a 
table D : 2"®+^ x (cr-|-l) ^ 2"*+^ is required, where a is the size of the pattern 
alphabet E. Thompson’s construction, on the other hand, is more regular and 
only needs a table D : 2^"® ^ 2^"® for the e-transitions. It has been shown 
that Glushkov’s construction normally yields faster search time. In any case, if 
the table is too big it can be split horizontally in two or more tables For 
example, a table of size 2"* can be split into 2 subtables of size 2"*/^. We need 
to access two tables for a transition but need only the square root of the space. 

Some techniques have been proposed to obtain a tradeoff between NFAs and 
DFAs. In 1992, Myers presented a four-russians approach which obtains 
0(mu/ \ogu) worst-case time and extra space. The idea is to divide the syntax 
tree of the regular expression into “modules” , which are subtrees of a reasonable 
size. These subtrees are implemented as DFAs and are thereafter considered as 
leaf nodes in the syntax tree. The process continues with this reduced tree until 
a single final module is obtained. 

The ideas presented up to now aim at a good implementation of the automa- 
ton, but they must inspect all the text characters. Other proposals try to skip 
some text characters, as it is usual for simple pattern matching. For example, 
Watson chapter 5] presented an algorithm that determines the minimum 
length of a string matching the regular expression and forms a tree with all the 
prefixes of that length of strings matching the regular expression. A multipat- 
tern search algorithm like Gommentz-Walter ^ is run over those prefixes as a 
filter to detect text areas where a complete occurrence may start. Another tech- 
nique of this kind is used in Gnu Grep 2. 0, which extracts a set of strings which 
must appear in any match. This string is searched for and the neighborhoods 
of its occurrences are checked for complete matches using a lazy deterministic 
automaton. 

The most recent development, also in this line, is from Navarro and Rafhnot 
13 . They invert the arrows of the DFA and make all states initial and the initial 
state final. The result is an automaton that recognizes all the reverse prefixes of 
strings matching the regular expression. The idea is in this sense similar to that 
of Watson, but takes less space. The search method is also different: instead of 
a Boyer-Moore like algorithm, it is based on BNDM Q. 
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2.2 Compressed Pattern Matching 

The compressed matching problem was first defined in the work of Amir and 
Benson Q as the task of performing string matching in a compressed text 
without decompressing it. Given a text T, a corresponding compressed string 
Z = zi . . . Zn, and a pattern P, the compressed matching problem consists in 
finding all occurrences of P in T, using only P and Z. A naive algorithm, which 
first decompresses the string Z and then performs standard string matching, 
takes time 0(m + u). An optimal algorithm takes worst-case time 0{m + n + R), 
where R is the number of matches (note that it could be that R = u > n). 

Two different approaches exist to search compressed text. The first one is 
rather practical. Efficient solutions based on Huffman coding ^9 on words have 
been presented by Moura et al. ^3, but they need that the text contains natural 
language and is large (say, 10 Mb or more). Moreover, they allow only searching 
for whole words and phrases. There are also other practical ad-hoc methods ^3, 
but the compression they obtain is poor. Moreover, in these compression formats 
n = 0{u), so the speedups can only be measured in practical terms. 

The second line of research considers Ziv-Lempel compression, which is based 
on finding repetitions in the text and replacing them with references to similar 
strings previously appeared. LZ77 ^3 i® ^ble to reference any substring of the 
text already processed, while LZ78 and LZW ^3 reference only a single 
previous reference plus a new letter that is added. 

String matching in Ziv-Lempel compressed texts is much more complex, since 
the pattern can appear in different forms across the compressed text. The first 
algorithm for exact searching is from 1994, by Amir, Benson and Farach who 
search in LZ78 needing time and space 0{m^ + n). 

The only search technique for LZ77 is by Farach and Thorup Q, a random- 
ized algorithm to determine in time 0{m -I- nlog^(u/n)) whether a pattern is 
present or not in the text. 

An extension of the first work Q to multipattern searching was presented by 
Kida et al. together with the first experimental results in this area. They 
achieve 0{m^ + n) time and space, although this time m is the total length of 
all the patterns. 

New practical results were presented by Navarro and Raffinot who pro- 
posed a general scheme to search on Ziv-Lempel compressed texts (simple and 
extended patterns) and specialized it for the particular cases of LZ77, LZ78 and a 
new variant proposed which was competitive and convenient for search purposes. 
A similar result, restricted to the LZW format, was independently found and pre- 
sented by Kida et al. ^3- The same group generalized the existing algorithms 
and nicely unified the concepts in a general framework ^3- Recently, Navarro 
and Tarhio ^3 presented a new, faster, algorithm based on Boyer-Moore. 

Approximate string matching on compressed text aims at finding the pattern 
where a limited number of differences between the pattern and its occurrences 
are permitted. The problem, advocated in 1992 Q, had been solved for Huffman 
coding of words ^3i but the solution is limited to search a whole word and 
retrieve whole words that are similar. The first true solutions appeared very 
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recently, by Karkkainen et al. mj, Matsumoto et al. and Navarro et al. 



3 The Ziv— Lempel Compression Formats LZ78 and LZW 

The general idea of Ziv-Lempel compression is to replace substrings in the text 
by a pointer to a previous occurrence of them. If the pointer takes less space 
than the string it is replacing, compression is obtained. Different variants over 
this type of compression exist, see for example Q. We are particularly interested 
in the LZ78/LZW format, which we describe in depth. 

The Ziv-Lempel compression algorithm of 1978 (usually named LZ78 B3) 
is based on a dictionary of blocks, in which we add every new block computed. 
At the beginning of the compression, the dictionary contains a single block bo 
of length 0. The current step of the compression is as follows: if we assume 
that a prefix Ti . ^ of T has been already compressed in a sequence of blocks 
Z = bi . . .br, all them in the dictionary, then we look for the longest prefix of 
the rest of the text . . which is a block of the dictionary. Once we found this 

block, say bg of length £g, we construct a new block br+i = (s, we write 

the pair at the end of the compressed file Z, i.e Z = bi . . .brbr+i, and we add 
the block to the dictionary. It is easy to see that this dictionary is prefix-closed 
(i.e. any prefix of an element is also an element of the dictionary) and a natural 
way to represent it is a tree. 

We give as an example the compression of the word ananas in Figure H The 
first block is (0,a), and next (0,n). When we read the next a, a is already the 
block 1 in the dictionary, but an is not in the dictionary. So we create a third 
block (1, n). We then read the next a, a is already the block 1 in the dictionary, 
but as do not appear. So we create a new block (1, s). 



Prefix encoded 


a 


an 


an an 


ananas 




0 


0 


0 


0 




y 


a / \ n 


a / \ n 


a / \ n 


Dictionary 


1 


/ \ 

1 2 


1 2 


1 2 








\ 










3 


4 3 


Compressed file 


(0,a) 


(0,a)(0,n) 


(0,a)(0,n)(l,n) 


(0,a)(0,n)(l,n)(l,s) 



Fig. 1. Compression of the word ananas with the algorithm LZ78. 



The compression algorithm is 0(u) time in the worst case and efficient in 
practice if the dictionary is stored as a tree, which allows rapid searching of 
the new text prefix (for each character of T we move once in the tree). The 
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decompression needs to build the same dictionary (the pair that defines the 
block r is read at the r-th step of the algorithm), although this time it is not 
convenient to have a tree, and an array implementation is preferable. Compared 
to LZ77, the compression is rather fast but decompression is slow. 

Many variations on LZ78 exist, which deal basically with the best way to 
code the pairs in the compressed file, or with the best way to cope with limited 
memory for compression. A particularly interesting variant is from Welch, called 
LZW ^ 3 . In this case, the extra letter (second element of the pair) is not coded, 
but it is taken as the first letter of the next block (the dictionary is started with 
one block per letter). LZW is used by Unix’s Compress program. 

In this paper we do not consider LZW separately but just as a coding variant 
of LZ78. This is because the final letter of LZ78 can be readily obtained by 
keeping count of the first letter of each block (this is copied directly from the 
referenced block) and then looking at the first letter of the next block. 



4 A Search Algorithm 



We present now our approach for regular expression searching over a text Z = 
bi . . .bm that is expressed as a sequence of n blocks. Each block br represents a 
substring B^- of T, such that Bi . . . Bn = T. Moreover, each block B^ is formed by 
a concatenation of a previously seen block and an explicit letter. This comprises 
the LZ78 and LZW formats. Our goal is to find the positions in T where the 
pattern occurrences end, using Z. 

Our approach is to modify the DFA algorithm based on bit-parallelism, which 
is designed to process T character by character, so that it processes T block by 
block using the fact that blocks are built from previous blocks and explicit letters. 
We assume that Glushkov’s construction Q is used, so the NFA has m-|-l states. 
So we start by building the DFA in 0(2"*) time and space. 

Our bit masks will denote sets of NFA states, so they will be of width m-|- 1. 
For clarity we will write the sets of states, keeping in mind that we can compute 
AUB, AOB, A**, A = B,A^B,aG Ain constant time (or, for long patterns, in 
0(\ml v]\) time, where w is the number of bits in the computer word). Another 
operation we will need to perform in constant time is to select any element of 
a set. This can be achieved with “bit magic”, which means precomputing the 
table storing the position of, say, the highest bit for each possible bit mask of 
length m -|- 1, which is not much given that we already store a such tables. 

About our automaton, we assume that the states are numbered 0 . . . m, being 
0 the initial state. We call F the bit mask of final states and the transition 
function is D : bitmasks x A — > bitmasks. 

The general mechanism of the search is as follows: we read the blocks b^ 
one by one. For each new block b read, representing a string B, and where we 
have already processed we update the state of the search so that after 

working on the block we have processed Ti . j_|_|b| = Ti,,,jB. To process each 
block, three steps are carried out: (1) its description is computed and stored, (2) 
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the occurrences ending inside the block B are reported, and (3) the state of the 
search is updated. 

Say that block b represents the text substring B. Then the description of b 
is formed by 

— a number len{b) = \B\, its length; 

— a block number ref(b), the referenced block; 

— a vector tro...m of bit masks, where tvi gives the states of the NFA that 
remain active after reading B if only the z-th state of the NFA is active at 
the beginning; 

— a bit mask act = U {z, tvi ^ 0}, which indicates which states of the NFA 
may yield any surviving state after processing B] 

— a bit mask fin, which indicates which states, if active before processing B, 
produce an occurrence inside B (after processing at least one character of 
B)-, and 

— a vector mato,, „i of block numbers, where maU gives the most recent (i.e. 
longest) block b' in the referencing chain b, ref (b), ref {ref (b)), . . . such that 
z G fin{b'), or a null value if there is no such block. 

The state of the search consists of two elements 

— the last text position considered, j (initially 0); 

— a bit mask S' of m + 1 bits, which indicates which states are active after 

processing Initially, S has active only its initial state, S = {0}. 

As we show next, the total cost to search for all the occurrences with this 
scheme is 0(2™+mzz+i?mlog m) in the worst case. The first term corresponds to 
building the DFA from the NFA, the second to computing block descriptions and 
updating the search state, and the last to report the occurrences. The existence 
problem is solved in time 0(2™ + mn). The space requirement is 0(2™ + mn). 
We recall that patterns longer than the computer word w get their search cost 
multiplied by \m/w~\. 

4.1 Computing Block Descriptions 

We show how to compute the description of a new block b' that represents 
B' = Ba, where B is the string represented by a previous block b and a is an 
explicit letter. An initial block bo represents the string e, and its description is: 
len{bo) = 0; tri{bo) = {z}; act{bo) = {0 . . .zzz}; fin{bo) = 0; mati{bo) = a null 
value. We give now the update formulas for B' = Ba. 

— len{b') ^ len{b) + 1. 

— ref{b') ^ b. 

— tri{b') ^ D{tri{b),a) (we only need to do this for z G act{b)). 

— act{b') ^ {z € act{b), tri{b') yf 0}. 

— fin{b') ^ fin{b) U {i € act{b'), tri(6') n F 0}. 

— mati{b') ^ mati{b) if tri{b') n F = 0, and b' otherwise. 

In the worst case we have to update all the cells of tr and mat, so we pay 
0{mn) time (recall that bit parallelism permits performing set operations in 
constant time). The space required for the block descriptions is 0{mn) as well. 
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4.2 Reporting Matches and Updating the Search State 

The fin{b') mask tells us whether there are any occurrences to report depending 
on the active states at the beginning of the block. Therefore, our first action is 
to compute S n fin{b'), which tells us which of the currently active states will 
produce occurrences inside B' . If this mask turns out to be null, we can skip the 
process of reporting matches. 

If there are states in the intersection then we will have matches to report 
inside B' . Now, each state i in the intersection produces a list of positions which 
can be retrieved in decreasing order using mati{b'), mati{ref{mati{b'))), . . .. 
If B' starts at text position j, then we have to report the text positions j + 
len{mati{b')) — 1, j + len{mati{ref{mati{b')))) — 1, .... These positions appear 
in decreasing order, but we have to merge the decreasing lists of all the states 
in S' n fin{b'). A priority queue can be used to obtain each position in O(logm) 
time. If there are R occurrences overall, then in the worst case each occurrence 
can be reported m times (reached from each state), which gives a total cost of 
0{Rm log m). 

Finally, we update S in 0{m) time per block with S <— Uigsnact(h') ti"i{b'). 



5 A Faster Algorithm on Average 

An average case analysis of our algorithm reveals that, except for mat, all the 
other operations can be carried out in linear time. This leads to a variation of 
the algorithm that is linear time on average. 

The main point is that, on average, \act{b)\ = \tri{b)\ = 0(1), that is, the 
number of states of the automaton which can survive after processing a block is 
constant. We prove in the Appendix that this holds under very general assump- 
tions and for “admissible” regular expressions (i.e. those whose automata run 
out of active states after processing 0(1) text characters). Note that, thanks to 
the self loop in the initial state 0, this state is always in act{b) and in tro{b). 

Constant Time Operations. Except for mat, all the computation of the block 
description is proportional to the size of act and hence it takes 0(n) time (see 
Section^J: tri{b') needs to be computed only for those i G act{b); and act{b') 
and fin{b') can also be computed in time proportional to \act{b)\ or \act{b')\. 
The update to S (see Section ^3 needs only to consider the states in act{b'). 
Each active bit in act is obtained in constant time by bit magic. 

Updating the mat Vector. What we need is a mechanism to update mat fast. 
Note that, despite that mati{b') is null if z ^ fin{b'), it may not be true that 
\fin{b')\ = 0(1) on average, because as soon as a state belongs to fin{b), it 
belongs to all its descendants in the LZ78 tree. 

However, it is still true that just 0(1) values of mat{b) change in mat{b'), 
where ref{b') — b, since mat changes only on those {i, tri{b')r\F 0} C act{b'), 
and \act{b')\ = 0(1). 
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Hence, we do not represent a new mat vector for each block, but only its 
differences with respect to the referenced block. This must be done such that (z) 
the mat vector of the referenced block is not altered, as it may have to be used 
for other descendants; and (ii) we are able to quickly find maU for any i. 

A solution is to represent mat as a complete tree (i.e. perfectly balanced), 
which will always have m + 1 nodes and associates the keys {0 . . .m} to their 
value mati. This permits obtaining in O(logm) time the value maU. We start 
with a complete tree, and later need only to modify the values associated to 
tree keys, but never add or remove keys (otherwise an AVL would have been a 
good choice) . When a new value has to be associated to a key in the tree of the 
referenced block in order to obtain the tree of the referencing block, we find the 
key in the old tree and create of copy of the path from the root to the key. Then 
we change the value associated to the new node holding the key. Except when 
the new nodes are involved, the created path points to the same nodes where the 
old paths points, hence sharing part of the tree. The new root corresponds to the 
modified tree of the new block. The cost of each such modification is O(logm). 
We have to perform this operation 0(1) times on average per block, yielding 
0(n log m) time. 

FigureHillustrates the idea. This kind of technique is usual when implement- 
ing the logical structure of WORM (write once read many) devices, in order to 
reflect the modifications of the user on a medium that does not permit alter- 
ations. 




Fig. 2. Changing node 5 to 5’ in a read-only tree. 



Reporting Matches. We have to add now the cost to report the R matches. 
Since \tri{b)\ = 0(1) on average, there are only 0(1) states able to trigger an 
occurrence at the end of a block, and hence each occurrence is triggered by 0(1) 
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states on average. The priority queue gives us those positions in O(logm) time 
per position, so the total cost to trigger occurrences is on average 0{R\ogm). 

Lowering Space and Preprocessing Costs. The fact that \tri{b) \ = 0(1) on aver- 
age shows another possible improvement. We have chosen a DFA representation 
of our automaton which needs 0(2™) space and preprocessing time. Instead, an 
NFA representation would require O(m^). The problem with the NFA is that, 
in order to build tri{b') for b' = (6, a), we need to make the union of the NFA 
states reachable via the letter a from each state in tr{b). This has a worst case 
of 0(m), yielding O(m^) worst case search time to update a block. However, 
on average this drops to 0(1) since only 0(1) states i have tri{b) yf 0 (because 
\act{b) \ = 0(1)) and each such tri{b) has constant size. 

Therefore, we have obtained average complexity 0{m^ + (n + R) logm). The 
space requirements are lowered as well. The NFA requires only 0(m) space. 
The block descriptions take 0(n) space because there are only 0(1) nonempty 
tri masks. With respect to the mat trees, we have that there are on average 
0(1) modifications per block and each creates O(logm) new nodes, so the space 
required for mat is on average O(nlogm). Hence the total space is 0(m -I- 
n log m) . 

If R is really small we may prefer an alternative implementation. Instead of 
representing mat, we store for each block a bit mask ffin, which tells whether 
there is a match exactly at the end of the block. While fin is active we go 
backward in the referencing chain of the block reporting all those blocks whose 
ffin mask is active in a state of S. This yields 0{m‘^ + n+Ru/n) time on average 
instead of 0{mf + {n + R) logm). The space becomes 0{m + n). 

6 Experimental Results 

We have implemented our algorithm in order to determine its practical value. 
We chose to use the LZW format by modifying the code of Unix’s uncompress, 
so our code is able to search files compressed with compress (.Z). This implies 
some small changes in the design, but the algorithm is essentially the same. 
We have used bit parallelism, with a single table (no horizontal partitioning) 
and map (at search time) the character set to an integer range representing the 
different pattern characters, to reduce space. Finally, we have chosen to use the 
ffin masks instead of representing mat. 

We ran our experiments on an Intel Pentium HI machine of 550 MHz and 64 
Mb of RAM. We have compressed 10 Mb of Wall Street Journal articles, which 
gets compressed to 42% of its original size with compress. We measure user time, 
as system times are negligible. Each data point has been obtained by repeating 
the experiment 10 times. 

In the absence of other algorithms for compressed regular expression search- 
ing, we have compared our algorithm against the naive approach of decompress- 
ing and searching. The WSJ file needed 3.58 seconds to be decompressed with 
uncompress. After decompression, we run two different search algorithms. A first 
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one, DFA, uses a bit-parallel DFA to process the text. This is interesting because 
it is the algorithm we are modifying to work on compressed text. A second one, 
the software nrgrep uses a character skipping technique for searching 
which is much faster. In any case, the time to uncompress is an order of magni- 
tude higher than that to search the uncompressed text, so the search algorithm 
used does not significantly affect the results. 

A major problem when presenting experiments on regular expressions is that 
there is not a concept of “random” regular expression, so it is not possible to 
search, say, 1,000 random patterns. Lacking such good choice, we fixed a set 
of 7 patterns which were selected to illustrate different interesting cases. The 
patterns are given in Tabled together with some parameters and the obtained 
search times. We use the normal operators to denote regular expressions plus 
some extensions, such as " [a-z] " = (a|5|c|...|z) and " . " = all the characters. 
Note that the 7th pattern is not “admissible” and the search time gets affected. 



Table 1. The patterns used on Wall Street Journal articles and the search times 
in seconds. 



No. 


Pattern 


m 


R 


Ours 


Uncompress 
-I- Nrgrep 


Uncompress 
+ DFA 


1 


American I Canadian 


17 


1801 


1.81 


3.75 


3.85 


2 


Amer [a-z] *can 


9 


1500 


1.79 


3.67 


3.74 


3 


Amer [a-z] *can I Can [a-z] *ian 


16 


1801 


2.23 


3.73 


3.87 


4 


Arne (i I (r I i) *) can 


10 


1500 


1.62 


3.70 


3.72 


5 


Am [a-z] *ri [a-z] *an 


9 


1504 


1.88 


3.68 


3.72 


6 


(Am 1 Ca) (er I na) (ic I di) an 


15 


1801 


1.70 


3.70 


3.75 


7 


Am. *er . *ic . *an 


12 


92945 


2.74 


3.68 


3.74 



As the table shows, we can actually improve over the decompression of the 
text followed by the application of any search algorithm (indeed, just the decom- 
pression takes much more time). In practical terms, we can search the original 
file at about 4-5 Mb/sec. This is about half the time necessary for decompression 
plus searching with the best algorithm. 

We have used compress because it is the format we are dealing with. In some 
scenarios, LZW is the preferred format because it maximizes compression (e.g. it 
compressed DNA better than LZ77). However, we may prefer a decompress plus 
search approach under the LZ77 format, which decompresses faster. For example. 
Gnu gzip needs 2.07 seconds for decompression in our machine. If we compare 
our search algorithm on LZW against decompressing on LZ77 plus searching, we 
are still 20% faster. 
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7 Conclusions 

We have presented the first solution to the open problem of regular expression 
searching over Ziv-Lempel compressed text. Our algorithm can find the R oc- 
currences of a regular expression of length m over a text of size u compressed by 
LZ78 or LZW into size n in 0(2"* -|- mn -|- Rmlogm) worst-case time and, for 
most regular expressions, 0{m? + {n + R) logm) or 0{m? + n + Ru/n) average 
case time. We have shown that this is also of practical interest, as we are able 
to search on compressed text twice as fast as decompressing plus searching. 

An interesting question is whether we can improve the search time using 
character skipping techniques The first would have to be combined with 

multipattern search techniques on LZ78/LZW For the second type of search 
(BNDM there is no existing algorithm on compressed text yet. We are also 
pursuing on extending these ideas to other compression formats, e.g. a Ziv- 
Lempel variant where the new block is the concatenation of the previous and 
the current one The existence problem seems to require O(m^n) time for 
this format. 
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Appendix: Average Number of Active Bits 

The goal of this Appendix is to show that, on average, |act(6)| = \tri{b)\ = 0(1). 
In this section a denotes the size of the text alphabet. 

Let us consider the process of generating the LZ78/LZW tree. A string from 
the text is read and the current tree is followed, until the new string read “falls 
out” of the tree. At that point we add a new node to the tree and restart reading 
the text. It is clear that, at least for Bernoulli sources, the resulting tree is the 
same as the result of inserting n random strings of infinite length. 

Let us now consider initializing our NFA with just state i active. Now, we 
backtrack on the LZ78 tree, entering into all possible branches and feeding the 
automaton with the corresponding letter. We stop when the automaton runs out 
of active states. 

The total amount of tree nodes touched in this process is exactly the amount 
of text blocks h whose i-th bit in act{b) is active, i.e. the blocks such that if we 
start with state i active, we finish the block with some active state. Hence the 
total amount of states in act over all the blocks of the text corresponds to the 
sum of tree nodes touched when starting the NFA initialized with each possible 
state i. 

As shown by Baeza-Yates and Gonnet the cost of backtracking on a 

tree of n nodes with a regular expression is 0(polylog(n)n^), where 0 < A < 1 
depends on the structure of the regular expression. This result applies only to 
random tries over a uniformly distributed alphabet and for an arbitrary regular 
expression which has no outgoing edges from final states. We remark that the 
letter probabilities on the LZ78 tree are more uniform than on the text, so even 
on biased text the uniform model is not so bad approximation. In any case the 
result can probably be extended to biased cases. 

Despite being suggestive, the previous result cannot be immediately applied 
to our case. First, it is not meaningful to consider such a random text in a 
compression scenario, since in this case compression would be impossible. Even 
a scenario where the text follows a biased Bernoulli or Markov model can be 
restrictive. Second, our DFAs can perfectly have outgoing transitions from the 
final states (the previous result is relevant because as soon as a final state is 
reached they report the whole subtrie). On the other hand, we cannot afford 
an arbitrary text and pattern simultaneously because it will always be possible 
to design a text tailored to the pattern that yields a low efficiency. Hence, we 
consider the most general scenario which is reasonable to face: 

Definition 1. Our arbitrariness assumption states that text and pattern are 
arbitrary but independent, in the sense that there is zero correlation between text 
substrings and strings generated by the regular expression. 

The arbitrariness assumption permits us extending our analysis to any text 
and pattern, under the condition that the text cannot be especially designed for 
the pattern. Our second step is to set a reasonable condition over the pattern. 
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The number of strings of length i that are accepted by an automaton is 
N{t) = = 0(c^ 

3 

where the sum is finitary and tt^ and tUj are constants. The result is simple 
to obtain with generating functions: for each state i the function fi{z) counts 
the number of strings of each length that can be generated from state i of the 
DFA, so if edges labeled a\. . .Uk reach states ii . . .ik from i we have fi{z) = 
z{fi^ (z) + . . . + fi^ (z) + 1 • [i final]), which leads to a system of equations formed 
by polynomials and possibly fractions of the form 1/(1 — z). The solution to the 
system is a rational function, i.e. a quotient between polynomials P(z)/Q(z), 
which corresponds to a sequence of the form ready now to 

establish our condition over the admissible regular expressions. 

Definition 2. A regular expression is admissible if the number of strings of 
length I that it generates is at most , where c < a, for any i = tu(l). 

Unadmissible regular expressions are those which basically match all the 
strings of every length, e.g. a(a|6)*a over the alphabet {a,b}, which matches 
2^/4 = 0{2^) strings of length £. However, there are other cases. For example, 
pattern matching allowing k errors can be modeled as a regular expression which 
matches every string for £ = 0{k) As we see shortly, we can handle some 
unadmissible regular expressions anyway. 

If a regular expression is admissible and the arbitrariness assumption holds, 
then if we feed it with characters from a random text position the automaton 
runs out of active states after 0(1) iterations. The reason is that the automaton 
recognizes strings of length £, out of the possibilities. Since text and pattern 

are uncorrelated, the probability that the automaton recognizes the selected 
text substring after £ iterations is 0{{c/aY) = O(a^), where we have defined 
a = cj a < 1. Hence the expected amount of steps until the automaton runs out 
of active states is = 1/(1 ~ ci) = 0{1). 

Let us consider a perfectly balanced tree of n nodes obtained from the text, 
of height h = log^, n. If we start an automaton at the root of the trie, it will 
touch O(c^) nodes at the tree level £. This means that the total number of nodes 
traversed is 



0(c'*) = = O(n^) 

for A < 1. So in this particular case we repeat the result that exists for random 
tries, which is not surprising. Let us now consdier the LZ78 tree of an arbitrary 
text, which has f{£) nodes at depth £, where 

h 

Y /(^) = n and /(O) = 1, f{£ - 1) < f{£) < 
e=o 

By the arbitrariness assumption, those f{£) strings cannot have correlation with 
the pattern, so the traversal of the tree touches a^f{£) of those nodes at level £. 
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Therefore the total number of nodes traversed is 

h 

C = 

1=0 



Let us now start with an arbitrary tree and try to modify it in order to 
increase the number of traversed nodes while keeping the same total number 
of nodes n. Let us move a node from level i to level j. The new cost is C = 
(7 — a* + Clearly we increase the cost by moving nodes upward. This means 
that the worst possible tree is the perfectly balanced one, where all nodes are 
as close to the root as possible. On the other hand, LZ78 tries obtained from 
texts tend to be quite balanced, so the worst and average case are quite close 
anyway. As an example of the other extreme, consider a LZ78 tree with maximum 
unbalancing (e.g. for the text a“). In this case the total number of nodes traversed 
is 0(1). 

So we have that, under the arbitrariness assumption, the total number of tree 
nodes traversed by an admissible regular expression is O(n^) for some A < 1. 
We use now this result for our analysis. 

It is clear that if we take our NFA and make state i the initial state, the 
result corresponds to a regular expression because any NFA can be converted 
into a regular expression. So the total amount of states in act is 

O + ... + 

where Xi corresponds to taking i as the initial state. We say that a state is admis- 
sible if, when that state is considered as the initial state, the regular expression 
becomes admissible. 

Note that, given the self-loop we added at state 0, we have ao = !> ke. state 
0 is unadmissible. However, all the other states must be admissible because 
otherwise the original regular expression would not be admissible. That is, there 
is a fixed probability p of reaching the unadmissible state and from there the 
automaton recognizes all the strings, which gives at least pa^ = 0{a^) strings 
recognized. 

Hence, calling 

A = max(ai, . . . , Om) < 1 

we have that the total number of active states in all the act bit masks is 

O (n + mn^) = 0{n) 

where we made the last simplification considering that m = 0(polylog(n)), 
which is weaker than usual assumptions and true in practice. Therefore, we 
have proved that, under mild restrictions (much more general than the usual 
randomness assumption), the amortized number of active states in the act masks 
is 0(1). 

Note that we can afford even that the unadmissible states are reachable only 
from 0(1) other states, and the result still holds. For example, if our regular 
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expression is a(a|6)*a'" we have only 0(1) initial states that yield unadmissible 
expressions, and our result holds. On the other hand, if we have a^{a\h)*a then 
the unadmissible state can be reached from 0{m) other states and our result 
does not hold. 

We focus now on the size of the tri{b) sets for admissible regular expressions. 
Let us consider the text substring B corresponding to a block b. 

We first consider the initial state, which is always active. How many states 
can get activated from the initial state? At each step, the initial state may ac- 
tivate 0(a) admissible states, but given the arbitrariness assumption, the prob- 
ability of each such state being active t steps later is 0{a^). While processing 
Hi .fc, the initial state is always active, so at the end of the processing we have 
J2i=o active states (the term corresponds to the point where we 

were processing Bk-i). 

We consider now the other m admissible states, whose activation vanishes 
after examining 0(1) text positions. In their case the probability of yielding 
an active state after processing B is O(a^). Hence they totalize O(ma^) active 
states. As before, the worst tree is the most balanced one, in which case there are 
(T^ blocks of lengths 0 to ft- = log^ n. The total number of active states totalizes 

h 

a^ma^ = 0{mc^) = O {mn^) 

1=0 



Hence, we have in total 0(n-|-?Tm^) = 0{n) active bits in the tri sets, where 
the n comes from the 0(1) states activated from the initial state and the mn^ 
from the other states. 
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Abstract. We explore the possibility of using multiple processors to 
improve the encoding and decoding tasks of Lempel Ziv schemes. A new 
layout of the processors is suggested and it is shown how LZSS and 
LZW can be adapted to take advantage of such parallel architectures. 
Experimental results show an improvement in compression and time over 
standard methods. 



1 Introduction 

Compression methods are often partitioned into static and dynamic methods. 
The static methods assume that the file to be compressed has been generated 
according to a certain model which is fixed in advance and known to both com- 
pressor and decompressor. The model could be based on the probability distri- 
bution of the different characters or more generally of certain variable length 
substrings that appear in the file, combined with a procedure to parse the file 
into a well determined sequence of such elements. The encoded file can then 
be obtained by applying some statistical encoding function, such as Huffman or 
arithmetic coding. Information about the model is either assumed to be known 
(such as the distribution of characters in English text), or may be gathered in a 
first pass over the file, so that the compression process may only be performed 
in a second pass. 

Many popular compression methods, however, are adaptive in nature. The 
underlying model is not assumed to be known, but discovered during the se- 
quential processing of the file. The encoding and decoding of the i-th element 
is based on the distribution of the i — 1 preceding ones, so that compressor 
and decompressor can work in synchronization without requiring the transmit- 
tal of the model itself. Examples of adaptive methods are the Lempel-Ziv (LZ) 
methods and their variants, but there are also adaptive versions of Huffman and 
arithmetic coding. 

We wish to explore the possibility of using multiple processors to improve the 
encoding and decoding tasks. In Q this has been done for static Huffman coding, 
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focusing in particular on the decoding process. The current work investigates how 
parallel processing could be made profitable for Lempel Ziv coding. 

Previous work on parallelizing compression includes which deal with 

LZ compression, Q, relating to Huffman and arithmetic coding, and Q. A par- 
allel method for the construction of Huffman trees can be found in Q. Our work 
concentrates on LZ methods, in particular a variant of LZ77 known as LZSS, 
and a variant of LZ78 known as LZW. In LZSS Q, the encoded file consists 
of a sequence of items each of which is either a single character, or a pointer of 
the form (off, len) which replaces a string of length len that appeared off char- 
acters earlier in the file. Decoding of such a file is thus a very simple procedure, 
but for the encoding there is a need to locate longest reoccurring strings, for 
which sophisticated data structures like hash tables or binary trees have been 
suggested. In LZW ^9, the encoded file consists of a sequence of pointers to a 
dictionary , each pointer replacing a string of the input file that appeared ear- 
lier and has been put into the dictionary. Encoder and decoder must therefore 
construct identical copies of the dictionary. 

The basic idea of parallel coding is partitioning the input file of size N into n 
blocks of size N/n and assigning each block to one of the n available processors. 
For static methods the encoding is then straightforward, but for the decoding, 
it is the compressed file that is partitioned into equi-sized blocks, so there might 
be a problem of synchronization at the block boundaries. This problem may be 
overcome by inserting dummy bits to align the block boundaries with codeword 
boundaries, which causes a negligible overhead if the block size is large enough. 
Alternatively, in the case of static Huffman codes, one may exploit their tendency 
to resynchronize quickly after an error, to devise a parallel decoding procedure 
in which each processor decodes one block, but is allowed to overflow into one 
or more following blocks until synchronization is reached Q. 

For dynamic methods one is faced with the additional problem that the 
encoding and decoding of elements in the i-th block may depend on elements 
of some previous blocks. Even if one assumes a CREW architecture, in which 
all the processors share some common memory space which can be accessed in 
parallel, this would still be essentially equivalent to a sequential model. This is 
so because elements dealt with by processor i at the beginning of block i may 
rely upon elements at the end of block i — 1 which have not been processed yet 
by processor i — 1; thus processor i can in fact start its work only after processor 
i — 1 has terminated its own. 

The easiest way to implement parallelization in spite of the above problem 
is to let each processor work independently of the others. The file is thus par- 
titioned into n blocks which are encoded and decoded without any transfer of 
data between the processors. If the block size is large enough, this solution may 
even be recommendable: most LZ methods put a bound on the size of the history 
taken into account for the current item, and empirical tests show that the addi- 
tional compression, obtained by increasing this history beyond some reasonable 
size, rapidly tends to zero. The cost of parallelization would therefore be a small 
deterioration in compression performance at the block boundaries, since each 
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processor has to “learn” the main features of the file on its own, but this loss 
will often be tolerated as it may allow to cut the processing time by a factor 
of n. In Q the authors suggest letting each processor keep the last characters 
of the previous block and thereby improve the encoding speed, but each block 
must then be larger than the size of the history window. On the other hand, 
putting a lower bound on the size N/n of each block effectively puts an upper 
bound on the number of processors n which can be used for a given file of size 
N , so we might not fully take advantage of all the available computing power. 

We therefore turn to the question how to use n processors, even when the 
size of each block is not very large. In the next section we propose a new parallel 
coding algorithm, based on a time versus compression efficiency tradeoff which is 
related to the degree of parallelization. On the one extreme, for full paralleliza- 
tion, each of the n processors works independently, which may sharply reduce 
the compression gain if the size of the blocks is small. On the other extreme, 
all the processors may communicate, forcing delays that make this variant as 
time consuming as a sequential algorithm. The suggested tradeoff is based on a 
hierarchical structure of the connections between the processors, each of which 
depending at most on log n others. The task can be performed in parallel by n 
processors in logn sequential stages. There will be a deterioration in the com- 
pression ratio, but the loss will be inferior to that incurred when all n processors 
are independent. 

In contrast to Huffman coding, for which parallel decoding could be applied 
regardless of whether the possibility having multiple processors at decoding time 
was known at the time of encoding, there is a closer connection between encoding 
and decoding for LZ schemes. We therefore need to deal also with the parallel 
encoding scheme, and we assume that the same number of processors is available 
for both tasks. 

Note, however, that one cannot assume simultaneously equi-sized blocks for 
both encoding and decoding. If encoding is done with blocks of fixed size, the 
resulting compressed blocks are of variable lengths. So one either has to store 
a vector of indices to the starting point of each processor in the compressed 
file, which adds an unnecessary storage overhead, or one performs a priori the 
compression on blocks of varying size, such that the resulting compressed blocks 
are all of roughly the same size. To get blocks of exactly the same size and to 
achieve byte alignment, one then needs to pad each block with a small number 
of bits, but in this case the loss of compression due to this padding is generally 
negligible. Moreover, the second alternative is also the preferred choice for many 
specific applications. For instance, in an Information Retrieval system built on 
a large static database, compression is done only once, so the speedup of par- 
allelization may not have any impact, whereas decompression of selected parts 
is required for each query to be processed, raising the importance of parallel 
decoding. 
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2 A Tree-Structured Hierarchy of Processors 

The suggested form of the hierarchy is that of a full binary tree, similarly to a 
binary heap. This basic form has already been mentioned in Q, but the way 
to use it as presented here is new. The input file is partitioned into n blocks 
Bi, . . Bn, each of which is assigned to one of the available processors. Denote 
the n processors by Pi,...,P„, and assume, for the ease of description, that 
n + 1 is a power of 2 , that is n = 2 ^ — 1 for some k. Processor Pi is at the 
root of the tree and deals with the first block. As there is no need to “point into 
the future” , communication lines between the processors may be unidirectional, 
permitting a processor with higher index to access processors with lower index, 
but not vice versa. Restricting this to a tree layout yields a structure in which 
P2i and P2i+i can access the memory of Pi, for 1 < i < (n — l)/ 2 . Figure 1 
shows this layout for n = 15 , the arrows indicating the dependencies between 
the processors. The numbers indicate both the indices of the blocks and of the 
corresponding processors. 




The compression procedure for LZSS works as follows: P\ starts at the be- 
ginning of block B\ , which is stored in its memory. Once this is done, P2 and P3 
start simultaneously their work on B2 and B3 respectively, both searching for 
reoccurring strings first within the block they have been assigned to, and then 
extending the search back into block Bi. In general, after Pi has finished the 
processing of block Bi, processors P2i and P2i+i start scanning simultaneously 
their corresponding blocks. The compression of the file is thus not necessarily 
done layer by layer, e.g., P12 and P13 may start compressing blocks B12 and R13, 
even if P5 is not yet done with B^. 

Note that while the blocks B2 and Bi are contiguous, this is not the case for 
i?3 and Bi, so that the {off, len) pairs do not necessarily point to close previous 
occurrences of a given string. This might affect compression efficiency, as one of 
the reasons for the good performance of LZ methods is the tendency of many 
files to repeat certain strings within the close vicinity of their initial occurrences. 
For processors and blocks with higher indices, the problem is even aggravated. 
The experimental section below brings empirical estimates of the resulting loss. 

The layout suggested in Figure 1 is obviously wasteful, as processors of the 
higher layers stay idle after having compressed their assigned block. The number 
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of necessary processors can be reduced by half, or, which is equivalent, the block 
size for a given number of processors may be doubled, if one allows a processor to 
deal with multiple blocks. The easiest way to achieve this is displayed in Figure 2, 
where the numbers in the nodes are the indices of the blocks, and the boldface 
numbers near the nodes refer to the processors. Processors 1, . . . , 2™ are assigned 
sequentially, from left to right, to the blocks of layer m, m = 0, 1, . . . , fc — 1. This 
simple way of enumerating the blocks has, however, two major drawbacks: refer, 
e.g., to block i ?9 which should be compressed by processor P2- First, it might 
be that Pi finishes the compression of blocks B2 and B4, before P2 is done 
with B3 . This causes an unnecessary delay, Bg having to wait until P2 processes 
both B3 and B^, which could be avoided if another processor would have been 
assigned to i?g, for example one of those that has not been used in the upper 
layers. Moreover, the problem is not only one of wasted time: P2 stores in its 
memory information about the blocks it has processed, namely B3 and B^ . But 
the compression of Bg does not depend on these blocks, but only on B4, B2 
and Bi. The problem thus is that the hierarchical structure of the tree is not 
inherited by the dependencies between the processors. 

To correct this deficiency of the assignment scheme, each processor will con- 
tinue working on one of the offsprings of its current block. For example, one 
could consistently assign a processor to the left child block of the current block, 
whereas the right child block is assigned to the next available newly used pro- 
cessor. More formally, let S'® be the index of the processor assigned to block j of 
layer i, where i = 0, . . . , k — 1 and j = 1, . . . , 2®, then S® = 1 and for z > 0, 

S^,_i = Sj-' and S^^=2®-i+j. 

The first layers are thus processed, from left to right, by processors with indices: 
(1), (1,2), (1, 3, 2, 4), (1, 5, 3, 6, 2, 7, 4, 8), etc. Figure 3(a) depicts the new 
layout of the blocks, the rectangles indicating the sets of blocks processed by 
the same processor. This structure induces a corresponding tree of processors, 
depicted in Figure 3(b). 




Figure 3: New hierarchical structure. 
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As a results of this method, processor Pi will start its work with block 
and then continue with etc. In each layer, the evenly indexed blocks 

inherit their processors from their parent block, and each of the oddly indexed 
blocks starts a new sequence of blocks with processors that have not been used 
before. 

The memory requirements of the processors have also increased by this new 
scheme, and space for the data of up to log 2 n blocks have to be stored. However, 
most of the processors deal only with a few blocks, and the average number of 
blocks to be memorized, when amortized over the n processors is 
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For the encoding and decoding procedures, we need a fast way to convert the 
index of a block into the index of the corresponding processor, i.e., a function /, 
such that f{i) = j if block Bi is coded by processor Pj. Define r{i) as the largest 
power of 2 that divides the integer i, that is, r{i) is the length of the longest 
suffix consisting only of zeros of the binary representation of i. 

Claim: = + 

Proof: By induction on i. For z = 1, we get /(I) = 1, which is correct. Assume 
the claim is true up to z — 1. If z is odd, r(z) = 0 and the formula gives /(z) = 
(z+ l)/2. As has been mentioned above, any oddly indexed block is the starting 
point of a new processor and indeed processor P(i+i )/2 starts at block Bi. If z is 
even, block Bi is coded by the same processor as its parent block Bi^ 2 , for which 
the inductive assumption applies, and we get 
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so that the formula holds also for z. 



2.1 Parallel Coding for LZSS 

We now turn to the implementation details of the encoding and decoding proce- 
dures for LZSS. Since the coding is done by stages, the parallel co-routines will 
invoke themselves the depending offsprings. For the encoding, the procedure 
PLZSS-encode(i, j) given in Figure 4 will process block Bi with processor Pj, 
where j = /(z). The whole process is initialized by a call to PLZSS-encode(l,l) 
from the main program. 

Each routine starts by copying the text of the current block into the memory 
of the processor, possibly adding to texts of previous blocks that have been 
stored there. As in the original LZSS, the longest substring in the history is 
sought that matches the suffix of the block starting at the current position. The 
search for this substring can be accelerated by several techniques, and one of the 
fastest is by use of a hash table ^3- The longest substring is then replaced by 
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PLZSS-encode{i, j) 

^ append text of Bi to memory of Pj 

cur < — 1 
while cur < \Bi\ 

^ S < — sufPix of Bi starting at cur 

ind < — i 
while ind > 0 



access memory of Pf(ind) and 

record occurrences in Bind matching a prefix of S 

ind < — [md/2j 



} 



if longest occurrence not long enough 
{ encode single character cur < — cur + 1 } 

else 

{ encode as {ojf, len) cur < — cur + len } 



perform in J if 2i < n PLZSS-encode{2i, j) 

parallel 1 if 2i + 1 < n PLZSS-encode(2i + 1, i + 1) 



Figure 4: Parallel LZSS encoding for block Bi by processor Pj. 



a pair (offset, length), unless length is too small (2 or 3 in implementations of 
m, such as the patent which is the basis of Microsoft’s DoubleSpace), in 
which case a single character is sent to output and the window is shifted by one. 

In our case, the search is not limited to the current block, but extends back- 
wards to the parent blocks in the hierarchy, up to the root. For example, referring 
to Figure 3, the encoding of block Bi^ will search also through Bq, B^ and Bi, 
and thus access the memory of the processors Pj, P 2 , P 2 and P\, respectively. 
Note that the size of the history window is usually limited by some constant W. 
We do not impose any such limit, but in fact, the encoding of any element is 
based on a history of size at most log 2 n x the block size. 

For the decoding, recall that we assume that the encoded blocks are of equal 
size Blocksize. The decoding routine can thus address earlier locations as if the 
blocks, that are ancestors of the current block in the tree layout, were stored 
contiguously. Any element of the form (off, len) in block Bj can point back into 
a block Bj, with j = [z/2^J for 6 = 0, 1, ... , [log 2 zj , and the index of this block 
can be calculated by 



b < — \ (off — cur + 1) / Blocksize) , 

where cur is the index of the current position in block Bj . The formal decoding 
procedure is given in Figure 5. 
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PLZSS-decode{i, j) 

^ cur < — 1 

while there are more items to decode 

^ if next item is a character 

{ store the character cur < — cur +1 } 



else // the item is {ojf, len) 

^ if off < cur II pointer within block Bt 

copy len characters, starting at position cur— off 
else // pointer to earlier block 

^ b « — \ {off — cur + 1) I Blocksize] 

t « — {off — cur) mod Blocksize 

copy len characters, starting at position t 

in block which is stored in 



} 



} 



cur < — cur + len 



perform in J if 2i < n PLZSS-decode{2i, j) 

parallel ^ if 2i + 1 < n PLZSS-decode {2i + 1, f + 1) 



Figure 5: Parallel LZSS decoding for block Bi on processor Pj. 



The input of the decoding routine is supposed to be a file consisting of a 
sequence of items, each being either a single character or a pointer of the form 
{off, len); cur is the current index in the reconstructed text file. 



2.2 Parallel Coding for LZW 

Encoding and decoding for LZW is similar to that of LZSS, with a few differences. 
While for LZSS, the “dictionary” of previously encountered strings is in fact the 
text itself, LZW builds a continuously growing table Table, which need not be 
transmitted, as it is synchronously reconstructed by the decoder. The table is 
initialized to include the set of single characters composing the text, which is 
often assumed to be ASCII. If, as above, we denote by S the suffix of the text in 
block Bi starting at the current position, then the next encoded element will be 
the index of the longest prefix R of S for which R G Table, and the next element 
to be adjoined to Table will be the shortest prefix R' of S for which R' ^Table; 
R is a prefix of R' and R' extends R by one additional character. 

During the encoding process of Bi, one therefore needs to access the tables 
in Bi itself and in the blocks which are ancestors of Bi in the tree layout, but 
the order of access has to be top down rather than bottom up as for LZSS. For 
each i, we therefore need a list listi of the indices of the blocks accessed on the 
way from the root to block Bi, that is, listi[ind] is the number whose binary 
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representation is given by the ind leftmost bits of the binary representation of 
i. For example, = [ 1 , 3 , 6 , 13 ]. 

To encode a new element P, it is first searched for in Table of Bi, and if 
not found there, then in Table of which is stored in the memory of 

processor Pf(listi[2])^ etc. However, storing only the elements in the tables may 
lead to errors. To illustrate this, consider the following example, referring again 
to Figure 3 . 

Suppose that the longest prefix of the string abode appearing in the Table 
of Bi is abc. Suppose we later encounter abed in the text of block B2- The 
string abed will thus be adjoined to the same Table, since both Bi and B2 are 
processed by the same processor Pi . Assume now that the texts of both blocks 
P5 and P3 start with abode. While for P5 it is correct to store abode as the first 
element in its Table, the first element to be stored in the Table of P3 should be 
abod, since the abed in the memory of Pi was generated by block B2, whereas 
P3 only depends on Bi. 



PLZW-encode{i, j) 

^ uj ^ Bi[l] 
cur < — 2 
while cur < \Bi\ 

^ ind « — 1 

while listi[ind] < i 

^ while cur < \Bi\ and 

(u}Bi[cur],ind) e Table stored in Pf(Usti[ind]) 

^ u) < — ojBi[cur] 

cur < — cur + 1 
y last < — ind 

ind < — ind + 1 

} 

indx « — index{w) in Table of Pf{nsti\iasi\) 
store {indx, last) in memory of Pj 
store {ioBi[cur],ind) in Table in memory of Pj 
to « — Bi[cur] 
cur < — cur + 1 

} 

perform in J if 2i < n PLZW-encode{2i, j) 

^ parallel ^ if 2f + 1 < n PLZW-encode{2i + l,i + 1) 

Figure 6: Parallel LZW encoding for block Bi on processor Pj. 
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To avoid such errors, we need a kind of a “time stamp” , indicating at what 
stage an element has been added to a Table. If the elements are stored sequen- 
tially in these tables, one only needs to record the indices of the last element for 
each block. But implementations of LZW generally use hashing to maintain the 
tables, so one cannot rely on deducing information from its physical location, 
and each element has to be marked individually. The easiest way is to store with 
each string P also the index i of the block which caused the addition of P. This 
would require log 2 n bits for each entry. One can however take advantage of the 
fact that the elements stored by different blocks Bi in the memory of a given 
processor correspond to different indices ind in the corresponding lists listi. It 
thus suffices to store with each element the index in listi rather than i itself, so 
that only log 2 log 2 n bits are needed for each entry. The formal encoding and 
decoding procedures are given in Figures 6 and 7, respectively. 

The parallel LZW encoding refers to the characters in the input block as 
belonging to a vector Bi [cur] , with cur giving the current index. If x and y are 
strings, then xy denotes their concatenation. As explained above, since the Table 
corresponding to block Bi is stored in the memory of a processor which is also 
accessed by other blocks, each element stored in the Table needs an identifier 
indicating the block from which is has been generated. The elements in the Table 
are therefore of the form {string, identifier). 

The output of LZW encoding is a sequence of pointers, which are the in- 
dices of the encoded elements in the Table. In our case, these pointers are of 
the form {index, identifier). There is, however, no deterioration in the compres- 
sion efficiency, as the additional bits needed for the identifier are saved in the 
representation of the index, which addresses a smaller range. 

For simplicity, we do not go into details of handling the incremental encoding 
of the indices, and overflow conditions when the Table gets full. It can be done 
as for the serial LZW. 

The parallel LZW decode routine assumes that its input is a sequence of 
elements of the form {index, identifier). The empty string is denoted by A. 

The algorithm in Figure 7 is a simplified version of the decoding, which does 
not work in case the current element to be decoded was the last one to be added 
to the Table. This is also a problem in the original LZW decoding and can be 
solved here in the same way. The details have been omitted to keep the emphasis 
on the parallelization. 

3 Experimental Results 

We now report on some experiments on files in different languages: the Bible 
(King James Version) in English, the Bible in Hebrew and the Dictionnaire 
philosophique of Voltaire in French. Table 1 first brings the sizes of the files in MB 
and to what size they can be reduced by LZSS and LZW, expressed in percent 
of the sizes of the original files. We consider three algorithms: the serial one, 
using a single processor and yielding the compressed sizes in Table 1, but being 
slow; a parallel algorithm we refer to as standard, where each block is treated 
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PLZW-decode{i, j) 

^ cur < — 1 

old < — A 

while cur < number of items in block Bi 

^ (indXjind) < — Bi[cur] 

access Table in Pf(HBti[ind]) at index indx 

and send string sir found there to output 
if old 7 ^ A 

store {old first[str], [log 2 (i+ 1)]) in Table of Pj 
old < — sir 
cur < — cur + 1 

} 

perform in J if 2i < n PLZW-decode{2i, j) 

parallel if 2i + 1 < n PLZW-decode{2i + l,i + f) 



Figure 7: Parallel LZW decoding for block Bi on processor Pj. 



independently of the others; and the new parallel algorithm presented herein, 
which exploits the hierarchical layout. The columns headed Time in Table 1 
compare the new algorithm with the serial one. The time measurements were 
taken on a Sun 450 with four UltraSPARC-II 248 MHz processors, which allowed 
a layout with 7 blocks. The values are in seconds and correspond to LZW, which 
turned out to give better compression performance than LZSS in our case. The 
improvement is obviously not expected to be 4-fold, due to the overhead of the 
parallelization, but on the examples the time is generally cut to less than half. 



Table 1: Size and time measurements on test files. 



Size Time 

Full compressed by compression decompression 





LZSS 


LZW 


Serial 


New 


Serial 


New 


English Bible 


3.860 41.6 


36.6 


5.508 


2.296 


3.653 


1.504 


Hebrew Bible 


1.471 51.7 


44.7 


2.134 


0.853 


1.488 


0.566 


Voltaire 


0.529 49.0 


40.6 


0.770 


0.380 


0.456 


0.310 



For the compression performance, we compare the two parallel versions. Both 
are equivalent to the serial algorithm if the block size is chosen large enough, 
as in Q. The graphs in Figure 8 show the sizes of the compressed files in MB 
as functions of the block size (in bytes), for both LZSS and LZW. We see that 
for large enough blocks (about 64K for LZSS and 128K for LZW) the loss rel- 
ative to a serial algorithm with a single processor is negligible (about 1%) for 
both the standard and the new methods. However, when the blocks become 
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shorter, the compression gain in the independent model almost vanishes, whereas 
with the new processor layout the decrease in compression performance is much 
slower. For blocks as small as 128 bytes, running a standard parallel compression 
achieves only about 1-4% compression for LZSS and about 12-15% for LZW, 
while with the new layout this might be reduced by some additional 30-40%. 





Block size (bytes) 



Figure 8: Size of compressed file as function of block size. 



We conclude that the simple hierarchical layout might allow us to consider- 
ably reduce the size of the blocks that are processed in parallel without paying 
too high a price in compression performance. As a consequence, if a large number 
of processors is available, it enables a better utilization of their full combined 
computing power. 
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Abstract. We focus on the problem of approximate matching of strings 
that have been compressed using run-length encoding. Previous studies 
have concentrated on the problem of computing the longest common 
subsequence (DCS) between two strings of length m and n, compressed 
to m! and n' runs. We extend an existing algorithm for the DCS to the 
Levenshtein distance achieving 0{m'n-\-n'm) complexity. This approach 
gives also an algorithm for approximate searching of a pattern of m letters 
(m' runs) in a text of n letters (n' runs) in 0(ram' n') time, both for DCS 
and Levenshtein models. Then we propose improvements for a greedy 
algorithm for the LCS, and conjecture that the improved algorithm has 
0{m'n') expected case complexity. Experimental results are provided to 
support the conjecture. 



1 Introduction 

The problem of compressed pattern matching is, given a compressed text T and 
a (possibly compressed) pattern P, find all occurrences of P in T without de- 
compressing T (and P). The goal is to search faster than by using the basic 
scheme: decompression followed by a search. 

In the basic approach, we are interested in reporting only the exact occur- 
rences, i.e. the locations of the substrings of T that match exactly pattern P. 
We can loosen the requirement of exact occurrences to approximate occurrences 
by introducing a distance function to measure the similarity between P and a 
substring of T. Now, we want to find all the approximate occurrences of P in 
T, where the distance between P and a substring of T is at most a given error 
threshold k. Often a suitable distance measure between two strings is the edit 
distance, where the minimum amount of character insertions, deletions, and re- 
placements, that are needed to make the two strings equal, is calculated. For 
this distance we are interested in fc < |P| errors. 
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** Supported in part by Fondecyt grants f-990627 and f-000929. 

A. Amir and G.M. Landau (Eds.): CPM 2001, LNCS 2089, pp. 31-^^ 2001. 
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Many studies have been made around the subject of compressed pattern 
matching over different compression formats, starting with the work of Amir and 
Benson e.g. Th^ ^nl^" works addressing the approximate variant of 

the problem have been on Ziv-Lempel 

Our focus is approximate matching over run-length encoded strings. In run- 
length encoding a string that consists of repetitions of letters is compressed 
by encoding each repetition as a pair ("letter", "length of the repetition"). For 
example, string aaabbbbccaab is encoded as a sequence (a, 3) (5, 4)(c, 2) (a, 2) (5, 1). 
This technique is widely used especially in image compression, where repetitions 
of pixel values are common. This is particularly interesting for fax transmissions 
and bilevel images. Approximate matching on images can be a useful tool to 
detect distortions. Even a one-dimensional compressed approximate matching 
algorithm would be useful to speed up existing two-dimensional approximate 
matching algorithms, e.g. 

Exact pattern matching over run-length encoded text can be done optimally 
in 0{m' -\-n') time, where m! and n' are the compressed sizes of the pattern and 
the text Approximate pattern matching over run-length encoded text has 
not been considered before this study, but there has been work on the distance 
calculation, namely, given two strings of length m and n that are run-length 
compressed to lengths m! and n! , calculate their distance using the compressed 
representations of the strings. This problem was hrst posed by Bunke and Csirik 
Q. They considered the version of edit distance without the replacement oper- 
ation, that is related to the problem of calculating the longest common subse- 
quence (LCS) of two strings. They gave an 0{m!n') time algorithm for a special 
case of the problem, where all run-lengths are of equal size. Later, they gave 
an 0{m'n -\- n' m) time algorithm for the general case |j)]. A major improve- 
ment over the previous results was due to Apostolico, Landau, and Skiena 
they hrst gave a basic 0(m! n' {m! -\- n')) algorithm, and further improved it to 
0{m!n' log(m'n')). Mitchell gave an algorithm with the same time comlexity 
in the worst case, but faster with 

some inputs; its time complexity is 0{{p -\- m! -\- n') log(p -|- m' -|- n')), where 
p is the amount of pairs of compressed characters that match {p equals 
to the amount of equal letter boxes, see the dehnition in Sect. 2.2). 

All these algorithms were limited to the LCS distance, although, Mitchell’s 
method s could be applied when different costs are assigned to the 
insertion and deletion operations. It still remain an open question 
(as posed by Bunke and Csirik) whether similar improvements could be found 
for a more general set of edit operations and their costs. 

We give an algorithm for matching run-length encoded strings under Leven- 
shtem distance In the Levenshtein distance a unit cost is assigned to each 
of the three edit operations. The algorithm is an extension of the 0{m'n-\-n'm) 
algorithm of Bunke and Csirik we keep the same cost but generalize the 
algorithm to handle a more complex distance model. Independently from our 
work, Arbell, Landau, and Mitchell have found a similar algorithm Q. 
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We modify our algorithm to work in a context of approximate pattern match- 
ing, and achieve 0 {mm! n') time for searching a pattern of length m that is run- 
length compressed to length m! , in a run-length compressed text of length n! . 
This algorithm works for both Levenshtein and LCS distance models. 

We also study the LCS calculation. First, we give a greedy algorithm for 
the LCS that works in 0 {m! n' {m! + n')) time. Adapting the well known di- 
agonal method ^3, we are able to improve the greedy method to work in 
0 {dP‘min{n' ,m!)) time, where d is the edit distance between the two strings 
(under insertions and deletions with the unit cost model). 

Then we present improvements for the greedy method for the LCS, that do 
not however affect the worst case, but do have effect on the average case. We end 
up conjecturing that our improved algorithm is 0 (m!n') time on average. As we 
are unable to prove it, we provide instead experimental evidence to support the 
conjecture. 



2 Edit Distance on Run-Length Compressed Strings 

2.1 Edit Distance 

Let E he a finite set of symbols, called an alphabet. A string A of length | A| = m 
is a sequence of symbols in E, denoted by A = Ai. = 0102 . . .am, where ai € E 
for every i. If |A| = 0, then A = A is an empty string. A subsequence of A is any 
sequence ai.^ai2 . . . , where 1 < Zi < Z2 • • • < Zfc < m. 

The edit distance can be used to measure the similarity between two strings 
A = 0102 . . . Qm and B = 6162 . . . by calculating the minimum cost of edit op- 
erations that are needed to convert A into B The usual edit operations 

are substitution (convert ai into bj, denoted by ai bj), insertion (A ^ bj), 
and deletion [at — > A). Different costs for edit operations can be given. For Lev- 
enshtein distance (denoted by Dl{A, B)) we assign costs w{a — > o) = 0, 

w{a ^ b) = 1 , w{a ^ A) = 1, and w{\ ^ o) = 1, for all o, 5 £ A7, o 5. If 
substitutions are forbidden, i.e. w{a —^b) = 00, we get the distance Dio{A, B). 

Distance Dl{A,B) can be calculated by using dynamic programming 
evaluate an (m -|- 1) x (n -|- 1) matrix (dij), 0 < z < m, 0 < j < n, using the 
recurrence 



dip = i, 0 < i < m, 

do,j = j, 0 <j< zz, (1) 

dij = mzrz(if oi = bj then di-\,j-i else di-\,j-i + 1, 
di-ij -h 1, dij-i + 1), otherwise. 

The matrix {dij) can be evaluated row-by-row or column-by-column in 0 {mn) 
time, and the value dmn equals Dp{A, B). 
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A similar method can be used to calculate the distance Did{A, B). Now, the 
recurrence is 



difi = i, 0 < i < m, 

doj = j, 0 < j < n, (2) 

dij = min(if at = bj then di-ij-i else oo, 

di-ij + 1, dij-i + 1), otherwise. 

The problem of calculating the longest common subsequence of strings A and 
B (denoted by LCS{A, B)), is related to the distance Djd{A, B). It is easy to 
see that 2 * \LCS{A, B)\ = m + n — Djd{A, B). 



2.2 Dividing the Edit Distance Matrix into Boxes 

A run-length encoding of the string A = ai 02 . . .am is A! = (ai,pi)(apj+i,p 2 ) 

{o,pi+P2 + li Ps) ■ ■ ■ + 1 t Pm') — (Oii ) Pi ) (^12 ) Ps) ■ ■ ■ (Oi^/ j Pm' ) ) whcrC 

{aik,Pk) denotes a sequence Ofc = = oFA of length |offc| = pk- 

We also call (oi^,pfc) a run of String A is optimally run-length encoded if 
foi" A\ 1 <k < m! . 

In the next sections, we will show how to speed up the evaluation of values 
dmn for both distances Dl(A,B) and Dio{A, B) when both the strings A and 
B are run-length encoded. In both methods, we use the following notation to 
divide the matrix {dij) into submatrices (see Fig.^. 



DP matrix 



aaaabbbbbbcccccbb 




overlapping borders of boxes 
■ comers 

one particular "box" 



I I equal letter box 

I I different letter box 



Fig. 1. A dynamic programmig matrix split into run-length blocks. 
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Let A' = (ai^,pi)(ai^,p 2 ) . . .(ai^,,pm>) and B' = (bj^, ri), (bj^, r 2 ) ■ ■ ■ 
(bj^,,Pn') be the run-length encoded representations of strings A and B. The 
rows and columns that correspond to the ends of runs in A and B separate the 
edit distance matrix (dij) into submatrices. To ease the notation later on, we 
define the submatrices so that they overlap on the borders. Formally, each pair 
of runs {bj^, r^) defines a {pk -f 1) x (r^ -|- 1) submatrix such that 

= di^+s-i,p+t-i, 0 < s <pk,0 <t < rt. (3) 

We will call submatrices (dgf) boxes. If a pair of runs corresponding to a 
box contain equal letters (i.e. = bjg), then (dgf) is called an equal letter 

box. Otherwise we call (d^’f) a dzjferent letter box. Adjacent boxes can form runs 
of different letter boxes along rows and columns. We assume that both strings 
are optimally run-length encoded, and hence runs of equal letter boxes can not 
occur. 

3 An 0{mn' + m'ri) Algorithm for the Levenshtein 
Distance 

Bunke and Csirik Q gave an Ofmn' + m' n) time algorithm for computing the 
LCS between two strings of lengths n and m run-length compressed to n' and 
ml . They pose as an open problem extending their algorithm to the Levenshtein 
distance. This is what we do in this section, without increasing the complexity 
to compute the new distance D^. Arbell, Landau, and Mitchell Q have inde- 
pendently found a similar algorithm. Their solution is also based on the same 
idea of extending the Ofmn' + m'n) LCS algorithm to the Levenshtein distance. 

Compared to the LCS-related distance Djo, the Levenshtein distance 
permits an additional character substitution operation, at cost 1. We compute 
Dl{A, B) by filling all the borders of all the boxes {d!l’l) (see Fig.J. We manage 
to fill each cell in constant time, which adds up the promised Ofmn' + m'n) 
complexity. The space complexity can be made 0{n -f m) by processing the 
matrix row- wise or column- wise. 



3.1 The Basic Algorithm 

We start with two lemmas that characterize the relationships between the border 
values in the boxes {d^’^). First, we consider the equal letter boxes: 

Lemma 1 (Bunke and Csirik Q) The recurrences Q and Q can be re- 
placed by 

d^j = if s < t then dlj_, else q, (4) 

where 1 < s < pfc and 1 < t < r^, for values dfl’t equal letter box. □ 
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Note that Lemma J holds for both Levenshtein and LCS distance models, 
because formulas ^ and Q are equal when at = bj. Since we are computing 
all the cells in the borders of the boxes, Lemma ^permits computing new box 
borders in constant time using those of previous boxes. 

The difhcult part lies in the different letter boxes. 

Lemma 2 The recurrence Q can be replaced by 

dg t 1 “b rniji ( t 1 -t- rniTi^Q^x{o,s—t)<q<s d^ Q , 

^ 1 “b '^l^'^max(0,t — s)<q<t (^) 

where 1 < s < pk (tnd 1 < t < r£, for values in a different letter box. 

Proof. We use induction on s + t. If s + t = 2 the formula Q becomes = 

1 + min{dQ’Q,di’Q,dQ’i), which matches recurrence Q. In the inductive case we 
have 



by recurrence 



ik,l 1 

ds,t = 1 



• f jk,£ jk,£ jk.l \ 



and using the induction hypothesis we get 



= 2 + min{min{ t-2 + minmax{o,s-t)<q<s-i 
^ '^'^'k^max(0,t — s)<q<t—l 

77iZ7l( t 1 + q, 

t 2 + TfiiTijy^Q^x{{),s — t-\-l)<q<s dg Q-, 

^ 1 '^'^'k^max(0,t—l — s)<q<t—l ^0,g)) 

= 1 + UliTl(t 1+ dq Q-, 

^ l“l“ ^^^max{0,t — s)<q<t 

where we have used the property that consecutive cells in the (dij) matrix differ 
at most by 1 Note that we have assumed s > 1 and t > 1. The particular 
cases s = 1 or t = 1 are easily derived as well, for example for s = 1 and t > 1 
we have 




1 I • f T]ki^ jk,i ik,i \ 

l + ™»^(ao,t-iVo,tVi.t-i) 

1 I • / jk.i jk.i 

l + min[d^f_^,dQf, 

1 + TYliTl{t 2 + 'kyi'i'klmax{[),2 — t)<q<l dqQ^TfiiTl^Q^x{{),t — 2)<q<t—l *^0,g)) 

1 + min (do’pi, d^j, t-l + min{dl'^^, 4’o), 1 + min{dlj_ 2 , do,’t-i)) 

1 + min + min{d^'Q, q), min{d!l'l_i, dof)) ; 



which is the particularization of formula Q for s = 1. 



□ 
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Formula Q relates the values at the right and bottom borders of a box to 
its left and top borders. Yet it is not enough to compute the cells in constant 
time. Although we cannot compute one cell in 0(1) time, we can compute all 
the pk (or ri) cells in overall 0{pk) (or 0 (r^)) time. 

Fig. 5 shows the algorithm. We use a data structure (which in the pseu- 
docode is represented just as a set M*) able to handle a multiset of elements 
starting with a single element, adding and deleting elements, and delivering its 
minimum value at any time. It will be used to maintain and update the minima 
'minmax{o,s-t)<q<s ^ 9,0 'minmax{o,t-s)<q<t d^'l, used in the formula Q. We 
see later that in our particular application all those operations can be performed 
in constant time. 

In the code we use dr^’^ = for the rightmost column and db^’^ = d!^f t 
for the bottom row. Their update formulas are derived from the formula 

dv g 1 “t“ T£ 1 -f 1TltTlj^(ix{0,s — ri)'^q'^s dv^ , 

^ 1 “b '^‘^'^max{0,r^ — s)<q<ri dbg ), 

db^ 1 “t“ 77rt7z( t 1 -f TTiijlyxKixip ^pk—t)<q<pk dl^q : 

Pk 1 “b 'kkliTljjiax(0,t—pk)<q<t dbg ' ). 

The whole algorithm can be made 0{n+m) space by noting that in a column- 
wise traversal we need, when computing cell (kl), to store only dr^~^'^ and 
db’^-^-^, so the space is that for storing one complete column (m) and a row whose 
width is one box (at most n). Our multiset data structure does not increase this 
space complexity. Hence we have 

Theorem 3 Given strings A and B of lengths m and n that are run-length 
encoded to lengths m! and n! , there is an algorithm to calculate Dl(A,B) in 
0{m!n -b n' m) time and 0{m -\- n) space in the worst case. 

□ 



3.2 The Multiset Data Structure 

What is left is to describe our data structure to handle a multiset of natural 
numbers. We exploit the fact that consecutive cells in (dij) differ by at most 1 
j i*- 1 . Our data structure represents the multiset S' as a triple {min{S) , max{S) , 
Vmin(S)...max(S) N). That is, wc storc the minimum and maximum value of the 
multiset and a vector of counters V , which stores at Vi the number of elements 
equal to i in S. Given the property that consecutive cells differ by at most 1, we 
have that no value Vi is equal to zero. This is proved in the following lemma. 

Lemma 4 No value Vi for min(S) < i < max{S) is egual to zero when S is a 
set of consecutive values in (dij) (i.e., S contains a contiguous part of a row or 
a column of the matrix (dij)). 

Proof. The lemma is trivially true for the extremes i = min(S) and i = max{S). 
Let us now suppose that Vj = 0 for an intermediate value. Let us assume that 
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Levenshtein (zl' = (oq , Pi ) (ai 2 , P 2 ).. . {ai^, B' = {bj^,ri){bj^,r 2 ) ■ ■ ■ (6j„, , r„/ ) ) 


1. 


/* We fill the topmost row 


and leftmost column first */ 


2. 


^ 0, ^ 0 




3. 


For fee 1 . . .m' Do 




4. 


For s G 0 . . .pfc Do ^ dr^ -|- s 


5. 


dbl’° ^ drp° 




6. 


For £ G 0 . . .n' Do 




7. 


For t e 0 . . . rr Do db°/ <- db°f~^ 


+ t 


8. 


dr°/ ^ db%^ 




9. 


/* Now we fill the rest of 


the matrix */ 


10. 


For £ G 1 . . . m' Do / * column-wise traversal */ 


11. 


For k G 1 . . .n' Do 




12. 


If flfc = bi Then /* equal letter box */ 


13. 


For s G 1 . . .pk Do 




14. 


If s < rr Then dr^ 


^ db^^;7f Else dr7^ ^ dr^Z-7! 


15. 


For t G 1 . . .r£ Do 




16. 


If Pk < t Then db^’^ 


^ dbth^ Else db^^/ ^ dr^fzl 


17. 


Else /* different letter box */ 


18. 


Mr ^ {drl’^-^}, Mt ^ 


{db^r7^’^} 


19. 


“'0 “'Pfc-l 




20. 


For 8 G 1 . . .pk Do 




21. 


Mr ^ Mr U 




22. 


Ifs>rrThenM, ^ Mr - 


23. 


Ifn;>sThenM6 ^ Mb U {db^r7-f} 


24. 


dr7^ <— 1 -f min{re - 


- 1 -|- min{Mr), s — 1 -|- min{Mb)) 


25. 


Mr ^ {dr';f-^}, Mb ^ 


{dbl~^’^} 


26. 


dbY ^ dbif:^ 




27. 


For t G 1 . . . Do 




28. 


If Pfc > t Then Mr ^ Mr U {dr^fl^} 


29. 


Mb ^ Mb U {db'l~^’^} 


30. 


If t > Pk Then Mb Mb - {dblzl’^_^} 


31. 


dh^ -s— 1 + min{t — 


1 -|- min{Mr),Pk -1-1- min{Mb)) 


32. 


Retnrn drff /* or db'Z'z’ */ 

Bm' n' 





Fig. 2. The 0{m'n+ n'm) time algorithm to compute the Levenshtein distance 
between A and B, coded as a run-length sequence of pairs {letter, run_length). 
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the value min{S) is achieved at cell dij and that the value max{S) is achieved at 
cell di'j'. Since all the intermediate cell values are also in S by hypothesis, and 
consecutive cells differ by at most 1, it follows that any value between min(S) 
and max{S) exists in a path that goes from dij to di'j'. □ 

Fig.Hshows the detailed algorithms. When we initialize the data structure 
with the single element S = {a;} we represent the situation as (x,x,Vx = 1). 
When we have to add an element y to S, we check whether y is outside the 
range min{S) . . . max{S), and in that case we extend the range. In any case 
we increment Vy. Note that the domain extension is never by more than one 
cell, as there cannot appear empty cells in between by LemmaH When we have 
to remove an element z from S we simply decrement 14 . If 14 becomes zero, 
LemmaHimplies that this is because z is either the minimum or the maximum 
of the set. So we reduce the domain of V by one. Finally, the operation min(S) 
is trivial as we have it already precomputed. 



Create (x) 

1. Return {x,x,Vx = 1) 

Add {{minS,maxS,V),y) 

2. It y < minS Then 

3. minS ^ y 

4. add new hrst cell Vy — 0 

5. Else If y > maxS Then 

6. maxS ^ y 

7. add new last cell I 4 = 0 

8. Vy < Vy -f 1 

9. Return (minS, maxS, V) 

Remove {{minS, maxS, V), z) 

10 . 14 ^ 14-1 

11. If 14 = 0 Then 

12. If z = minS Then 

13. remove hrst cell from V 

14. minS ^ minS -I- 1 

15. Else /* z = maxS */ 

16. remove last cell from V 

17. maxS ^ maxS — 1 

18. Return {minS, maxS, V) 

Min {{minS, maxS, V)) 

19. Return minS 



Fig. 3. The multiset data structure implementation. 
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It is easily seen that all the operations take constant time. As a practical 
matter, we note that it is a good idea to keep y in a circular array so that it 
can grow and shrink by any extreme. Its maximum size corresponds to pk (for 
Mr) or (for Mb), which are known at the time of Create. 

4 Approximate Searching 

Let us now consider a problem related to computing the LCS or the Levenshtein 
distance. Assume that string A is a short pattern and string B is a long text 
(so m is much smaller than n), and that we are given a threshold parameter 
k. We are interested in reporting all the “approximate occurrences” of A in B, 
that is, all the positions of text substrings which are at distance k or less from 
the pattern A. In order to ensure a linear size output, we content ourselves with 
reporting the ending positions of the occurrences (which we call “matches”). 

The classical algorithm to hnd all the matches computes a matrix exactly 
like those of recurrences ^ and with the only difference that doj = 0. This 
permits the occurrences to start at any text position. The last row of the matrix 
dmj is examined and every text position j such that dm,j < fc is reported as a 
match. 

Our goal now is to devise a more efficient algorithm when pattern and text are 
run-length compressed. A trivial 0{m?"n' + R) algorithm (where R is the size of 
the output) is obtained as follows. We start hlling the matrix only at beginnings 
of text runs, and complete the hrst 2m columns only (at 0{m?) cost). The rest 
of the columns of the run are equal to the 2m-th because no optimal path can 
be longer than 2m— 1 under the LCS or Levenshtein models. We later examine 
the last row of the matrix and report every text position with value < k. If the 
run is longer than 2m, then we have not produced the whole last row but only 
the hrst 2m cells of it. In this case we report the positions 2m + 1 . . .ri of the 
£-th run if and only if the position 2m was reported. 

We improve now the trivial algorithm. A hrst attempt is to apply our algo- 
rithms directly using the new base value do,j = 0- This change does not present 
complications. 

Let us hrst concentrate on the Levenshtein distance. Our algorithm obtains 
0(m'n + n'm) time, which may or may not be better than the trivial approach. 
The problem is that 0{m'n) may be too much in comparison to 0{m?n'), es- 
pecially if n is much larger than m. We seek for an algorithm proportional to 
the compressed text size. We divide the text runs in short (of length at most 
2m) and long (longer than 2m) runs. We apply our Levenshtein algorithm on 
the text runs, hlling the matrix column- wise. If we have a short run (ai^,,r^), 
ri < 2m, we compute all the m! + 1 horizontal borders plus its hnal vertical bor- 
der (which becomes the initial border of the next column). The time to achieve 
this is 0{m'ri + m). For an additional 0{ri) cost we examine all the cells of the 
last row and report all the text positions ii + t such that d™ < k. 

If we have a long run (oi^, ri), > 2m, we limit its length to 2m and apply 
the same algorithm, at 0{m'm + m + m) cost. The columns 2m + 1 . . .ri of 
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that run are equal to the 2m-th, so we just need to examine the last row of 
the 2m-th column, and report all the text positions up to the end of the run, 
U + 2m+l...ii+pk,ii 

This algorithm is 0{n'm'm + R) time in the worst case, where R is the 
number of occurrences reported. For the LCS model we have the same upper 
bound, so we achieve the same complexity. Our 0{m! n' {m! + n')) algorithm does 
not yield a good complexity here. The space is that to compute one text run 
limited to length 2m, i.e. 0{m'm). 

Note that if we are allowed to represent the occurrences as a sequence of runs 
of consecutive text positions (all of which match), then the R extra term of the 
search cost disappears. 

Theorem 5 Given a pattern A and a text B of lengths m and n that are run- 
length encoded to lengths m! and n' , there is an algorithm to find all the ending 
points of the approximate occurrences of A in B, either under the LCS or Lev- 
enshtein model, in 0{m! mn') time and 0{m!m) space in the worst case. 

□ 



5 Improving a Greedy Algorithm for the LCS 

The idea in our algorithm for the Levenshtein distance in Sect. 3 was to fill 
all the borders of all the boxes The natural way to reduce the complexity 

would be to fill only the corners of the boxes (see Fig.^. For the distance 
this seems difficult to obtain, but for the Diu distance there is an obvious greedy 
algorithm that achieves this goal; in different letter boxes, we can calculate the 
corner values in constant time, and in equal letter boxes we can trace an optimal 
path to a corner in 0{m' + n') time. Thus, we can calculate all the corner values 
in 0{m'n' {m! n')) timi^ 

It turns out that we can improve the greedy algorithm significantly by fairly 
simple means. We notice that the diagonal method of ^3 can be applied, and 
achieve an 0{dS"min{n' m')) algorithm. We give also other improvements that do 
not affect the worst case, but are significant in the average case and in practice. 
We end the section conjecturing that our improved algorithm runs in 0{m!n') 
time in the average. As we are unable to prove this conjecture, we provide ex- 
perimental evidence to support it. 

5.1 Greedy Algorithm for the LCS 

Calculating the corner value in a different letter box is easy, because it 

can be retrieved from the values and = dtfz} , which 

^ Apostolico et. al. ^ also gave a basic Ofm! n' {m! -|- n')) algorithm for the LCS, 
which they then improved to 0{m'n' log{m'n')). Their basic algorithm differs from 
our greedy algorithm in that they were using the recurrence for calculating the LCS 
directly, and we are calculating the distance Dm- Also, they traced a specihc optimal 
path (which was the property that they could use to achieve the 0{m'n' log{m'n')) 
algorithm). 
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are calculated earlier during the dynamic programming. This follows from the 
lemma: 



Lemma 6 (Bunke and Csirik [^) The recurrence Q can be replaced by the 
recurrence 



ds,t = + t, 

where 1 < s < pfc and 1 < t < r^, for values d^’^ 



+ s), 

m a different letter box. 



( 6 ) 

□ 



In contrast to the distance, the difhcult part in Dju distance lies in equal 
letter boxes. As noted earlier, Lemmajapplies also for the Djo distance. From 
LemmaHwe can see that the corner values are retrieved along the diagonal, 
and those values may not have been calculated earlier. However, if pk = ri in all 
equal letter boxes, then each corner can be calculated in constant time. 

This gives an 0{m'n') algorithm for a special case, as previously noted in Q. 

What follows is an algorithm to retrieve the value in an equal letter box 

in 0{ml + n') time. The idea is to trace an optimal path to the cell dp’^.^.^. This 
can be done by using lemmas Jand^recursively. Assume that dp’^.^^ = 
by Lemma J (case dp’^.^^ = dpf_.^^ q is symmetric). If fc = 1, then the value 
dQ’r^_pj, corresponds to a value in the hrst row (0) of the matrix (dij) which 
is known. Otherwise, the box is a different letter box, and using the 

dehnition of overlapping boxes and Lemmaflit holds 



^0,r£-pk ~ ^Pk-i,ri-pk 



^MdUfp + n- Pk, + Pk-i)- 



Now, the value dp^_\’^o i® calculated during the dynamic programming, so we 

can continue on tracing value using lemmas ^ and ^ recursively until 

we meet a value that has already been calculated during dynamic programming 
(including the hrst row and the hrst column of the matrix {dij). The recursion 
never branches, because Lemmajdehnes explicitly the next value to trace, and 
one of the two values (from which the minimum is taken over in Lemma^ is 
always known (that is because we enter the different letter boxes at the borders, 
and therefore the other value is from a corner that is calculated during the 
dynamic programming). We call the path described by the recursion a tracing 
path. 

Tracing the value dp’^.^^ in an equal letter box may take 0{m' + n') time, 
because we are skipping one box at a time, and there are at most m' + n' boxes 
in the tracing path. Therefore, we get an 0{m! n' {m! + n')) algorithm to calculate 
DjniA.B). A worst case example that actually achieves the bound is A = a” 
and B = (o6)"/2. 

The space requirement of the algorithm is 0{m!n'), because we need to store 
only the corner value in each box, and the 0{m' + n') space for the stack is not 
needed, because the recursion does not branch. 

We also achieve the 0{m'n + n'm) bound, because the corner values dp’^.^^ 
of equal letter boxes dehne distinct tracing paths, and therefore each cell in the 
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borders of the boxes can be visited only once. To see this observe that each border 
cell reached by a tracing path uniquely determines the border cell it comes from 
along the tracing path, and therefore no two different paths can meet in a border 
cell. The only exception is a corner cell, but in this case all the tracing paths 
end there immediately. 



Theorem 7 Given strings A and B of lengths m and n that are run-length 
encoded to lengths m' and n' , there is an algorithm to calculate Djd{A, B) in 
0{min{m' rifm' -\- n -\- n'm)) time and 0{m!n') space. □ 



5.2 Diagonal Algorithm 



The diagonal method provides an 0{dmin(m, n)) algorithm for calculating 
the distance d = Duo (A, B) (or as well) between strings A and B of length 
m and n, respectivily. The idea is the following: The value dmn = Duo(A, B) 
in the (dij) matrix of Q defines a diagonal band, where the optimal path must 
lie. Thus, if we want to check whether Duo < k, we can limit the calculation 
to the diagonal band defined by value k (consisting of 0{k) diagonals). Starting 
with k = |n — m|-|-l, we can double the value k and run in each step the recur- 
rence B on the increasing diagonal band. As soon as dmn < k, we have found 
Did{A, B) — dmn, and we can stop the doubling. The total number of diagonals 
evaluated is at most 2 Duo (A, B), and there are at most min(m, n) cells in each 
diagonal. Therefore, the total cost of the algorithm is 0{dmin(m,n)), where 
d = Duo {A, B). 

We can use the diagonal method with our greedy algorithm as follows: We 
calculate only the corner values that are inside the diagonal band defined by 
value k in the above doubling algorithm. The corner values in equal letter boxes 
inside the diagonal band can be retrieved in 0{k) time. That is because we 
can limit the length of the tracing paths with the value 2fc -|- 1 (between two 
equal letter boxes there is a different letter box that contributes at least 1 to 
the value that we are tracing, and we are not interested in corner values that 
are greater than k). Therefore, we get the total cost 0{d?"min{m' ,n'f), where 
d — Duo {A, B). 



5.3 Faster on Average 



There are some practical refinements for the greedy algorithm that do not im- 
prove its worst case behavior, but do have an impact on its average case. 

First of all, the runs of different letter boxes can be skipped in the tracing 
paths. 
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Consider two consecutive different letter boxes ) and By Lemma 

lit holds for the values 1 < t 



= mm 



in 



Ot 






H” t 



) 



jK 

' Pk-\-l7 

/ ik,i I ik-\-l,l I 

= mm l^dp^ t+pk+udp^^^^ + tj 

= min [dlf + pk+ Pk+udl’l^ + pk+i + t, dp'^^fg + t) 

= mm -\-pk-\- Pk-\-i , Sfe+;,o + ■ 

The above result can be extended to the following lemma by using induction: 

Lemma 8 Let {{d’^f), {d’",P’‘^), , {d!lj)) and {{d!lj ), {d!lj +^), . . . , {d!lj)) be 
vertical and horizontal runs of different letter boxes. When 1 < t < ri and 
1 < s <Pk, the recurrence (4) can be replaced by the recurrences 



dn,t = min 



ik,£ 

^Pfc,0 






■k',i 



’ ^0,t 



jk.i • I ^k,i I jk,l I 
d, V, = mm I d^fi + s, d,f + 



^s,0 




1 < S < Pfc- 



□ 

Now it is obvious how to speed up the retrieval of values in the equal 

letter boxes. During dynamic programming, we can maintain pointers in each 
different letter box to the last equal letter box encountered in the direction of the 
row and the column. When we enter a different letter box while tracing the value 
of dp’^.^^ in an equal letter box, we can use Lemma^to calculate the minimum 
over the run of different letter boxes at once, and continue on tracing from the 
equal letter box preceding the run of different letter boxes. (Note that in order 
to use the summations of LemmaHwe should better store the cumulative ik and 
jl values instead of pfc and r^.) Therefore we get the following result: 

Theorem 9 Given strings A and B of lengths m and n that are run-length 
encoded to lengths m' and n' , such that all the runs of different letters over an 
alphabet of size 1^71 are egually likely and in random order, there is an algorithm 
to calculate Djd{A, B) in 0{m'n'(l + (m' + n')/\B\'^)) time in the average. 

Proof. (Sketch) The hrst part of the cost, 0{m'n') comes from the constant time 
computation of all the different letter boxes. On the other hand, there are on 
the average Ofm' n' /\B\) equal letter boxes. Between two runs of a letter a £ E, 
there are on the average 1^71 — 1 runs of other letters. This holds both for strings 
A and B. In other words, the expected length of a run of different letter boxes 
is |L7| — 1. Therefore the retrieval of the value dp’^.^^ in an equal letter box takes 
time at most 0{{m' + n')/\E\) in the average. □ 
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The second improvement to the greedy algorithm is to limit the length of 
the tracing paths. In the greedy algorithm the tracing is continued until a value 
is reached that has been calculated during the dynamic programming. However, 
there are more known values than those that have been explicitly calculated. 
Consider value 1 < t < (or symmetrically 1 < s < pk) in the border 

of a different letter box. If dp’^^.^ = dp’^Q + ri then it must hold d!^f ^ 
otherwise we get a contradiction: dp’^^^ < dp’^ q -|- ri. 

We call the above situation a horizontal (vertical) bridge. Note that from 
Lemmaflit follows that there is either a vertical or a horizontal bridge in each 
different letter box. When we enter a different letter box in the recursion, we 
can check whether the bridge property holds at the border we entered, using the 
corner values that are calculated during the dynamic programming. Thus, we can 
stop the recursion at the first bridge encountered. To combine this improvement 
with the algorithm that skips runs of different letter boxes, we need Lemma^J 
below that states that the bridges propagate along runs of different letter boxes. 
Therefore we only need to check whether the last different letter box has a bridge 
to decide whether we have to skip to the next equal letter box. The resulting 
algorithm is given in pseudo-code in Fig.H 

Lemma 10 Let . . . , {d^’^)) be a vertical run of different letter- 
boxes. If there is a horizontal bridge = dpf^Q + ri then there is a horizontal 

bridge dpjfj.^ = dpJ^Q + ri for all k' < k” < k. The symmetric result holds for 
horizontal runs of different letter boxes. 

Proof. We use the counter-argument that dpjf.^^ = dpj ^ q + ri does not hold for 
some k' < k” < k. Then by Lemma^and by the bridge assumption it holds 

fc" fc" 

“o.r/ + 2^ Ps = do,o ' +re+ Ps- 

s— fc' + l s—k'-\-l 

On the other hand, using the counter-argument and the fact that consecutive 
cells in the (dij) matrix differ at most by 1 | , we get 

k" \ 

S — fc' + l J 



^k",l 



. jk" I ^ jK I 

^ + - ^ 0,0 + 



k'+i,e 



dk",£ 

Pk" 



which is a contradiction and so the the original proposition holds. □ 

Lemma^Jhas a corollary: if the last different letter box in a run does not 
have a horizontal (vertical) bridge, then none of the boxes in the same run have 
a horizontal (vertical) bridge and, on the other hand, all the boxes in the same 
run must have a vertical (horizontal) bridge. 

Now, if two tracing paths cross inside a box (or run thereof), then one of 
them necessarily meets a bridge. In the average case, there are a lot of crossings 
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ILCS 

1 . 

2 . 

3. 

4. 

5. 

6 . 

7. 

8 . 

9. 

10 . 
11 . 
12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34. 



(A' = (aii,pi)(ai 2 ,p 2 ) . . . (tti = (bj^,ri){bj^,r 2 ) . . . (6j„ , ,r„/)) 



/* We use structure d*’* to denote a box as follows: */ 



.corner := 

d^’^ .jumptop := "location of the next equal letter box above" 
d^’^ .jumpleft := "location of the next equal letter box in the left" 
d^’^.sumtop ■.= If Oi, 7 ^ bj^^^ Then Et=d'=.hj„mptop+i P* 
d^’^^.sumleft ■.= If Oi, / Then .jumpUft+i 

j* Initialize hrst row and column (let Oip = bjg = e,po = ro = 1) */ 
d^^. corner ^ 0 
For fc G l...n' Do 



d‘ 



fc - 1,0 



.corner + rk-i 



For i G 1 ... m' Do d^’^ .corner <— dP’^~^ .corner pi-i 
Calculate values d^’^ .{jumptop, jumpleft, sumtop, sumleft) 

/* Now we hll the rest of the corner values * j 

For fc G 1 . . . m' Do 
For f G 1 . . . n' Do 

{bridge,k' ,p,r, sum, d^’^. corner) <— {false,k,£,pk,re,0,oo) 

If Oif, 7 ^ bj^ Then /* Different letter box */ 

d^’^ .corner ^ min{d^~^’^ .corner + aif.,d^’^~^ .corner + bj^) 
Else While bridge = false Do 

/* Equal letter box, trace d^’^ .corner * j 

If p = r Then /* Straight from the diagonal */ 

d^’^ .corner <— min(d^’^ .corner, sum + d^ .corner) 

bridge ^ true 

Else If p < r Then /* Diagonal up */ 

{r,k') <— (r— p, fc'— 1) 

d^’^. corner <— min{d^’^ .corner, sum + d^ .corner + r) 
If d^ .corner = d^ .corner + r^/ Then bridge ^ true 
Else /* Jump to the next equal letter box */ 

{sum,k') ^ (sum + d^ .sumtop,d^ .jumptop) 

P ^ Pk> 

If fc' = 0 Then /* First row */ 

d^’^ .corner <— min(d^’^ .corner, 

sum + d^ .corner + r) 

bridge ^ true 

Else /* Diagonal left similarly*/ 

Return [m + n — .corner) j"! /* return the length of the LCS */ 



Fig. 4. The improved greedy algorithm to compute the LCS between A and B, 
coded as a run-length sequence of pairs {letter, run _length). 
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of the tracing paths and the total cost for tracing the values in equal letter boxes 
decreases. 

Another way to consider the average length of a tracing path is to think that 
every time a tracing path enters a different letter box, it has some probability 
to hit a bridge. If the bridges were placed randomly in the different letter boxes, 
then the probability to hit a bridge would be i. This would give immediately a 
constant expected length for a tracing path. However, the placing of the bridges 
depends on the computation of recurrence Q, and this makes the reasoning 
with probabilities much more complex. We are still confident that the following 
conjecture holds, although we are not (yet) able to prove it. 

Conjecture 11 Let A and B be strings that are run-length encoded to lengths 
m' and n' , such that the runs are egually distributed with the same mean in both 
strings. Under these assumptions the expected running time of the algorithm in 
Fig.^^for calculating Djjo{A, B) is 0{m'n'). 

5.4 Experimental Results 

To test the Conjecture we ran the algorithm in Fig. J with the following 
settings: 

1. m' = n' = 2000, 1271 = 2, runs in [1, a:] 

a; G {1, 10, 100, 1000, 10000, 100000, 1000000}. 

2. m' = 2000, n' G {1, 50, 100, 500, 1000, 1500, 2000}, |27| = 2, runs in [1, 1000]. 

3. m' = n' = 2000, |27| G {2,4,8,16,32,64,128,256}, runs in [1,1000]. 

4. String A was as in item 1 with runs in [1, 1000]. String B was generated by 
applying k random insertions/deletions on A, where k G {0, 1, 10, 100, 1000, 
10000 , 100000 }. 

5. Real data: three different black/white images (printed lines from a book draft 
(187 X 591), technical drawing (160 x 555), and a signature (141 x 362)). We 
ran the LCS algorithm on all pairs of lines in each image. 

Tablejshows the results. Different parameter choices are listed in the order 
they appear in the above listing (e.g. setting 1 in test 1 corresponds to a; = 1, 
setting 2 corresponds to a: = 10, etc.). 

The average length L of a tracing path (i.e. the amount of equal letter boxes 
visited by a tracing path) was smaller than 2 in tests 1-4 (slightly greater in test 
5). That is, the running time was in practice 0{m'n') with a very small constant 
factor. Test 1 showed that when the mean length of the runs increases, then also 
L increases, but not exceeding 2 (L G [1, 1.99]). In test 2, the worst situation was 
with n' = m! {L = 1.98). We tested the effect of the alphabet in test 3, and the 
worst was |27| = 2 (L = 1.99) and the best was |27| = 256 {L = 1.13). Test 4 was 
used to simulate a typical situation, in which the distance between the strings 
is small. The amount of errors did not have much influence (L G [1-71, 1-72]). In 
real data (test 5), there were also pairs that were close to the worst case (close 
to A = a”, B — (a6)”/^), and therefore the results were slightly worse than with 
randomly generated data: L G {2.00, 2.34, 2.31} with the three images. 
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Table 1. The average length and the maximum length of a tracing path was 
measured in different test settings. The values of tests 1-4 are averages over 
10-10000 trials (e.g. on small values of n' in test 2, more trials were needed 
because of high variance, whereas otherwise the variance was small). Test 5 was 
deterministic (i.e. the values are from one trial). 





Average length of a tracing path (maximum length) 


test X 


setting 1, setting 2, ... 


test 1 
test 2 
test 3 
test 4 
test 5 


1 (1), 1.71 (18), 1.96 (28), 1.98 (27), 1.98 (32), 1.99 (29), 1.98 (25) 

1.73 (5), 1.77 (10), 1.74 (13), 1.80 (21), 1.90 (30), 1.97 (35), 1.98 (38) 

1.99 (30), 1.77 (20), 1.60 (14), 1.45 (14), 1.33 (9), 1.24 (7), 1.17 (6), 1.13 (6) 
1.71 (9), 1.71 (8), 1.71 (7), 1.71 (10), 1.72 (9), 1.72 (10), 1.72 (12) 

2.00 (35), 2.34 (146), 2.32 (31) 



6 Conclusions 

We have presented new algorithms to compute approximate matches between 
run-length compressed strings. The previous algorithms permit computing 
their LCS. We have extended an LCS algorithm Q to the Levenshtein distance 
without increasing the cost, and presented an algorithm with nontrivial complex- 
ity for approximate searching a run-length compressed pattern on a run-length 
compressed text under either model. 

Future work involves adapting our algorithm to more complex versions of the 
Levenshtein distance, including at least different costs for the edit operations. 
This would be interesting for applications related to image compression, where 
the change from a pixel value to the next is smooth. 

With respect to the original models, an interesting question is whether an 
algorithm can be obtained whose cost is just the product of the compressed 
lengths. Indeed, this seems possible in the average case, as demonstrated by the 
experiments with our improved algorithm for the LCS. 

Finally, a combination of two-dimensional approximate pattern matching al- 
gorithm with two-dimensional run-length compression seems extremely in- 
teresting. 
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Abstract. The upcoming so-called “on-chip Billion transistor” era raises 
the question: What to do with all the on-chip hardware once the returns 
on adding more on-chip memory start to diminish? 

Parallel computing has been a strategic area of growth for computer 
science since the 1940s. So far, parallel computing affected main stream 
computer science only in a limited way. The key problem with parallel 
computers has been their programmability. 

The parallel algorithms research community has developed a theory of 
parallel algorithms for a very simple parallel computation model, the 
so-called PRAM (for parallel random-access machine, or model). That 
theory appears to be second in magnitude only to serial algorithmics. 
However, the evolution of parallel computers never reached a situation 
where the PRAM algorithmic computation model offered effective ab- 
straction for them. So, this elegant algorithmic theory remained in the 
ivory towers of theorists. Not only that it has not been matched with a 
real computer system, there has hardly been an experimental study of 
what works better, more refined performance measurements, and a broad 
study of applications. For example, the general question “how good par- 
allel algorithms can really be” has remained generally open. 

Explicit Multi-Threading (XMT) is a new fine-grained computation 
framework which tries to address the hardware opportunity using the 
PRAM parallel algorithmic knowledge base. XMT aims at faster single- 
task completion time by way of executing in parallel many instruction all 
within a single chip. Building on some key ideas of parallel computing, 
XMT covers the spectrum from algorithms through architecture to im- 
plementation; the main implementation related innovation in XMT was 
through the incorporation of low-overhead hardware mechanisms (for 
more effective fine-grained parallelism). 

The two key research questions facing our “PRAM-on-chip vision” are: 
(i) “how to build?” an XMT computer, and (ii) “who cares?”; that is, 
what will be the key applications? 
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Abstract. We introduce a new notion of weak factor recognition that 
is the foundation of new data structures and on-line string matching al- 
gorithms. We define a new automaton built on a string p — pip 2 ■ ■ -Pm 
that acts like an oracle on the set of factors pi . . .pj. If a string is recog- 
nized by this automaton, it may be a factor of p. But, if it is rejected, 
it is surely not a factor. We call it factor oracle. More precisely, this au- 
tomaton is acyclic, recognizes at least the factors of p, has m -1- 1 states 
and a linear number of transitions. We give a very simple sequential 
construction algorithm to build it. Using this automaton, we design an 
efficient experimental on-line string matching algorithm (we conjecture 
its optimality in regard to the experimental results) that is really simple 
to implement. We also extend the factor oracle to predict that a string 
could be a suffix {i.e. in the set pi . . .pm) of p. We obtain the suffix or- 
acle, that enables in some cases a tricky improvement of the previous 
string matching algorithm. 

Keywords: Finite automaton, string matching, algorithm design, 
information retrieval. 



1 Introduction 

A string p is a sequence p = piP 2 ■ ■ ■ Pm of letters taken in a finite alphabet E. 
We keep the notation p along this paper to denote the string we are working on. 
A factor of p is a string pi . . .pj, 1 < i < j < m. 

The basic string matching problem is to find all occurrences of a pattern 
string p in a large text T. Efficient on-line string matching algorithms are based 
on indexes built on p. 

Many indexing techniques exist for this purpose. The simplest methods use 
precomputed tables of g-grams while more advanced methods use more elabo- 
rated data structures. These classical structures are: suffix arrays, suffix trees, 
suffix automata or DAWGs, and factor automata (see Q for a survey). 

* Work partially supported by Wellcome Trust Foundation and by NATO Grant 
PST.CLG.977017. 
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The notion on which these structures and these on-line string matching al- 
gorithms are based is exact factor recognition. It means that we need to know if 
a given string u is or is not a factor of the pattern p. This notion leads to very 
time-efficient string matching algorithms, but presents two major drawbacks: (a) 
the structures require a fairly large amount of memory space (which implies a 
large number of memory page breaks when searching for the string in the text); 
(b) the algorithms are rather involved to implement. 

It is considered, for example, that the implementation of suffix arrays can 
be achieved using five bytes per string character and that other structures need 
about twelve bytes per string character. 

We propose in this paper a new approach based on the notion of weak factor 
recognition. The idea is to recognize more than the exact factors of p to win in 
simplicity and memory requirements. 

For this purpose, we build a new structure, called factor oracle, that can 
replace many of these indexes in on-line string matching algorithm. More pre- 
cisely, this structure is an automaton (a) that is acyclic (b) that recognizes at 
least the factors of p (c) that has the fewest states possible {i.e. m -\- 1) and 
(d) that has a linear number of transitions. The suffix and factor automata 
satisfy (a)-(b)-(d) but not (c) whereas the sub-sequence automaton Q satisfies 
(a)-(b)-(c) but not (d). 

We give two different construction algorithms for the factor oracle, the first is 
only conceptual (and is used as definition), the second is a really simple practical 
sequential algorithm. 

The relations between our factor oracle and the suffix automaton allow us to 
define another new structure: the suffix oracle. 

From a theoretical point of view, these two structures are of interest. They 
represent the first attempt to formalize the notion of weak factor recognition. 
Although their constructions are very simple, their properties are rather difficult 
to establish and many points remain open and require further studies. 

We use these two new structures to design new experimental on-line string 
matching algorithms. These algorithms have a very good average behavior that 
we conjecture optimal. The main advantages of these new algorithms are (1) 
that they are easy to implement for an optimal behavior and (2) that they are 
in pratice as fast as the fastest ones. A preliminary abstract version of this paper 
appears in Q. 

The factor oracle can be extended to a set of strings, and be used in multi 
string matching algorithms, that leads to very promising experimental results 

B- 

We now define the notions and the notations we need along this paper. We 
denote Fact(p) the set of all the factors of string p. A factor a; of p is a prefix 
(resp. a suffix) of p if p = xu (resp. p = ux) with u G E*. The set of all 
the prefixes of p is denoted by Pref(p) and the one of all the suffixes Suff(p). 
We say that a; is a proper factor (resp. proper prefix, proper suffix) of p if a; is 
a factor (resp. prefix, suffix) of p distinct from p and from the empty string 
e. We denote prefp(i) the prefix of length z of p for 0 < i < |p|. We denote 
for u G Fact(p), poccur(u,p) = min{|z| , z = wu and p = wuv}, the ending 
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position of the first occurrence of u in p. Finally, we define for u G Fact(p) the 
set endpoSp(M) = {i\p = wupi+i . . .pm\- 

2 Factor Oracle 

2.1 Construction Algorithm 



Build_Oracle(p = piP 2 • • - Pm) 

1. For i from 0 to m 

2. Create a new state i 

3. For i from 0 to m — 1 

4. Build a new transition from i to i + 1 by pi+i 

5. For i from 0 to m — 1 

6. Let u be a minimal length word in state i 

7. For all a G pi+i 

8. If U(j e Fact(pi_|„|+i . . .pm) 

9. Build a new transition from i to i + poccur(itcr,Pi_|„|_|_i . . .pm) by a 



Fig. 1. High-level construction algorithm of Oracle (p). 



Definition 1. The factor oracle of a string p = piP 2 ■ ■ - Pm is the automaton 
build by the algorithm Build_Oracle (Figure^^ on the string p, where all the 
states are terminal. It is denoted by Oracle{p) . 

A string w is recognized in state i by the factor oracle if it labels a path 
from state 0 to state i. The factor oracle of the string p = abbbaab is given as 
an example in Figure^ On this example, it can be noticed that the string aba 
is recognized whereas it is not a factor of p. 




Fig. 2. Factor oracle of abbbaab. The word aba is recognized whereas it is not a 
factor. 



Note: all the transitions that reach state i of Oracle(p) are labeled by pi. 

Lemma 1. Let u G S* be a minimal length string among the strings recognized 
in state i of Oracle{p). Then, u G Fact{p) and i = poccur{u,p). 
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Proof. By induction on the number of the state i. It is true for states 0 and 1. 
Assume that it is true for all states 0 < j < i — 1. Let u be a minimal length 
string among the strings recognized in state i. We consider the last transition 
on the path labeled by u leading from 0 to b 

(i) This transition was built at line 2. Then u = zpi with z S S*. As u is a 
minimal length string of state i, z is a minimal length string for state i — 1. 
By the induction hypothesis, z G Fact(p) and i — 1 = poccur(z,p). And 
therefore, u = zpi G Fact(p) and i = poccur(u,p). 

(ii) This transition was built at line 7. Then it leads from j to i labeled by pi 
with 0 < j < i — 1. So u = zpi and z is a minimal length string for state 
j. By the induction hypothesis, z G Fact(p) and j = poccur(z,p). And on 
account of the construction of the transition line 7, u = zpi G Fact(p) and 
i = poccur(u,p). 

□ 

Corollary 1. Let u G S* be a minimal length string among the strings reeog- 
nized in state i of Oracle{p), u is unique. 

We denote min(i) the minimal length string recognized in state i. 

Corollary 2. Let i and j he two states of Oracle{p) such as j < i. Let u = 
min{i) and v = min{j), u can not be a suffix of v. 

Proof. Assume that u is a suffix of v. In this case, poccur(u,p) < poccur(u,p) 
which is a contradiction of (according to lemma^ J < b □ 

Lemma 2. Let i he a state of Oraele{p) and u = min{i). u is a suffix of any 
string c G E* recognized in state i. 

Proof. By induction on the number of the state i. It is true for states 0 and 1. 
Assume that it is true for all states 0 < j < i — 1. We consider state i and let 
u = min(z). 

Let c be a path leading to state i, c = cia where ci leads to state j < i. Let 
V = min(j). Then 

— |ua| > |u| because u is the minimum of i and va leads to i. 

— according to the construction (Figure B line 7), i G endpoSp(ua) because 
V = min(j) and there is a transition from j to i by a. 

— lemmaHsets i G endpoSp(u). 

From these three facts, it comes that u is a suffix of va. As j < i, by the induction 
hypothesis, u is a suffix of ci and therefore u is a suffix of c. □ 

Lemma 3. Let w G Fact{p). w is recognized by Oracle{p) in a state j < 
poccur{w,p). 

Proof. By induction on the length of re = wqWi . . .Wf. 

We denote i = poccur(w,p). 
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— There is a transition from 0 by wq leading to a state ko < i — f. 

— Assume that there is a path labeled by wq . . . Wj leading to a state kj < 
i — f + j- Let u = min(fcj). According to lemma^ u = Wj-\u\+i ■ ■ - Wj- As 
uwj+i is a factor of p and as i — / + (j + 1) G endpoSp(uWj+i), there is a 
transition from kj labeled Wj+i leading to a state fcj+i < t — / + (j + !)• 

□ 

Corollary 3. Let w G Fact{p). Every string v G Suff{w) is reeognized by 
Oracle{p) in a state j < poccur{w). 

Lemma 4. Let i be a state of Oracle{p) and u = min{i). Any path ending by u 
leads to a state j > i. 

Proof. By induction on the number of the state i. It is true for states 0 and 1. 
Assume that it is true for all states 0 < j < i — 1. We consider state i. Let u be 
a minimal length string among the strings recognized in state i. 

Consider the minimal length path labeled by u leading to i. We denote j 
the state preceding i along this path, and v = min(j). We have u = va hy 
construction. Consider a path c = ciu leading to a state k. Assume that k < i. 

By the induction hypothesis, the path civ leads to a state I > j. li I = j, 
then k = i which is a contradiction with k < i. We have now j < I < k < i. Let 
w = min(/). 

— If |w| > |u|, then u is a suffix of w (lemmaH- In that case, u = va is a, 
suffix of wa, so that k G endpoSp(u) which is a contradiction with k < i = 
poccur(u,p) (lemmaH- 

— If |w| < |u|, then w is a suffix of v (lemmaH. This is a a contradiction with 
j < I (corollary^. 

In both cases, we reach a contradiction, so that k > i and the induction hypoth- 
esis is verified for i. □ 

Lemma 5. Let w G S* be a string reeognized by Oraele{p) in i, then any suffix 
of w is recognized in a state j < i. 

Proof. By induction on |w|. It is true if |w| = 0 or |w| = I. Assume that it is 
true for all the strings C, such that |^| < |w|. We show that it is also true for w, 
recognized in i. 

Let w = fa, f is recognized in fc < i. Consider a proper suffix of w. It can be 
written va where u is a proper suffix of f. 

According to the induction hypothesis, v is recognized in I < k. Let f = 
min{k) and v = min{l). The lemma^implies that ^ is a suffix of f and that v is 
a proper suffix of v. The corollaryHsets that h is a suffix of f. Asi G endpoSp(<^a) 
(by construction of the transition by a), z S endpoSp(ua). So here is a transition 
from I by a leading to a state j < i. Thus, the proper suffix va is recognized in 
state j. □ 

The number of states of Oracle (p) with p = p\P 2 ■ ■ .Pm is m -|- 1. We now 
consider the number of transitions. 
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Lemma 6. The number Tor{p) of transitions in Oracle{p = piP 2 ■ ■ - Pm) satis- 
fies m < Tor{p) < 2m — 1. 

Proof. There is always m transitions i i 1 labeled by Pi+i- As the strings 
o'" only have these transitions, m is the minimal number of transitions. 

Consider now the transitions i j where j > i+1. We build an injective func- 
tion that maps each of these transitions to a proper suffix of p. Each transition 
i ^ j with j > i -I- 1 labeled by a is mapped to the string vava{i)apj+i . . .pm- 
By construction of Oracle(p), this string is a proper suffix of p. We show a 
contrario that this function is injective. Assume that there are two distinct 
transitions ii — > ji and ^ j 2 respectively labeled by cti and (T 2 such that 
min(zi)criPji+i . . .pm = min(i 2 )(T 2 Pj 2 -l-i ■ ■ -Pm- 

Assume that A > Z 2 . We consider three cases. If ji = j 2 , then we have 
(Ti = (72 and the equality implies min(zi) = min(z 2 ) and ii = Z 2 . If ji > jh, 
min(z 2 )(T 2 is a proper prefix of min(zi). Let S = |min(zi)| — |min(z 2 )(T 2 |, then 
we have jh = ji — ^ As an occurrence of min(zi) ends in A (according to 
lemmaj, an occurrence of min(z 2 )(T 2 ends in A — < ji — — 1 = J 2 - This is a 

contradiction with the construction of Oracle(p). If ji < jhj min(zi) is a prefix 
of min(z 2 ), so there is an occurrence of min(zi) ending before Z 2 < zi- This is a 
contradiction with lemmaj 

The three other cases, for zi < Z 2 , are resolved in a symmetric way. The 
function is indeed injective, and as the set of proper suffixes of p = piP 2 ■ ■ - Pm 
is of size m — 1, Tor{p) < m -\- m — 1 = 2m — 1. This maximum is reached for 
the strings a^~^b. □ 

The factor automaton can be coded in a memory efficient simple way. We do 
not have to code the states since they are positions in the string. We just have 
to code the external transitions. Their labels do not have to be coded since they 
are fixed by their arrival states. 



2.2 Sequential Algorithm 

This section presents a sequential construction of the automaton Oracle (p), that 
means a way of building the automaton by reading the letters of p one by one 
from left to right, upgrading the automaton at each step. 

We denote repetp(z) the longest suffix of prefp(z) that appears at least twice 
in prefp(z). We define a function Sp on the states of the automaton, called supply 
function, that maps each state z > 0 of Oracle (p) to state j in which the reading 
of repetp(z) ends. We arbitrarily set Sp{0) = —1. Notice that Sp{i) is well defined 
for every state z of Oracle (p) (Corollary^, and that for any state z of Oracle (p), 
z > Sp{i) (lemmaH. We denote fco = m, ki = Sp{ki-i) for z > 1. The sequence 
of the ki is finite, strictly decreasing and ends in state 0. We denote CSp = 
{fco = m, fci, . . . , fct = 0} the suffix path of p in Oracle(p). 

Lemma 7. Let k > 0 be a state of Oracle{p) such that s = Sp{k) is strictly 
positive. We denote Wk = repetp{k) and Ws = repetp(s). Then Ws is a suffix of 
Wk. 
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Proof. Let j be the state of Oracle(p) reached after the reading of Wg- Let v = 
min(s). 

— j < s. LemmaH implies that j < poccur(ws) which is by definition strictly 
less than s. 

— |tCs| < |t;|. Assume the contrary. Then there is a path ending by t; = min(s) 
leading to j < s. This is a contradiction with lemma J 

Lemma O implies s = poccur('i;,p). As Wg is a suffix of prefp(s) of lenght less 
than |r;|, Wg is a proper suffix of v, and, as v is himself a suffix of Wk (lemmaH, 
Wg also is. □ 

Corollary 4. Let CSp = {fco, fci . . . , fc* = 0} he the suffix path of p in Oracle{p) 
and let Wi = repetp(ki-i) for 1 < i <t and wq = p. Then, for 0 < I < t, wi is a 
suffix of all the Wi, 0 < i < I < t. 

We now consider for a string p = piP 2 ■ ■ ■ Pm and a letter a G S the con- 
struction of Oracle (per) from Oracle (p). We denote Oracle (p) -b cr the automaton 
Oracle(p) on which a transition by a from state m to state m -|- 1 is added. 
We already notice that a transition that exists in Oracle (p) + a also exists in 
Oracle(pcr), so that the difference between the two automata only rely on transi- 
tions by a to state m+l that have to be added to Oracle(p) -b cr in order to get 
Oracle(p(r). We are investigating states from which there may exist transitions 
by (7 to state m -b 1. 

Lemma 8. Let k be a state of Oracle{p) -b cr sueh that there is a transition from 
k by a to m + 1 in Oracle{pa) . Then k has to be one of the states of the suffix 
path CSp in Oraelefp) + a. 

Proof. On the contrary, assume that there exists a state < k < ki such that 
there is a transition to m -b 1 by cr in Oracle (per). We denote wj = repetp{kj-i) 
for 1 < j <t, and wq = p. We have m G endpoSp(wj). Let v = min{k). As there 
is a transition by a from k to m + 1, m G endpoSp(?;), and v is comparable to 
the factors Wj (corollary^. The factor v must satisfy: 

(i) |z;| < |wi|. Assume, on the contrary, that |?;| > |wi|. |ti| > |wi|, or else there 
will be two path labeled by v leading to two different states. Consider the 
greatest 0 < d < i such that |wd-i-i| < |'c| < |wd|. The factor is a suffix of 
Wd. As V = min(fc), according to lemmaj it occurred in fc < kd, and as it 
is also a suffix of Wd, it occurred at least twice in prefp(fcd). In that case, by 
definition of the kj, kd+i can not be Sp{d), and there is a contradiction. So 
|'c| < |wi| and u is a proper suffix of Wi. 

(ii) |z;| > |wi+i|. Assume on the contrary that |z;| < |wi+i|. |t>| is then a proper 
suffix of Wi+i and the path labeled Wi+i leading to ki+i < k ends hy v = 
min(fc). This is a contradiction with lemmaH 

(iii) |f| < |min(fci)|. Assume on the contrary that |u| > |min(fci)|. The factor 
min(fci) is a suffix of Wi (lemma 2) of which v is also a suffix (by (i)). So 
min(fci) is a suffix of z; = min(fc) and we get a contradiction with corollary^ 
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As is a suffix of min(fci) (by (iii)), h G endpoSp(?;) (lemmaj. But ki > k G 
endpoSp(r;) and v occurs at least twice in prefp(fci). As (by (ii))|'c| > |wi+i|, there 
is a contradiction with Wi+i = repetp{ki). □ 

Among the states on the suffix path of p, every state that has no transition 
by a in Oracle (p) + a must have one in Oracle (per). More formally, the following 
lemma sets this fact. 

Lemma 9. Let ki < m he a state on the sujjix path CSp of state m in Oracle{p = 
P 1 P 2 ■ ■ - Pm) + O'. If ki does not have a transition by a in Oracle{p), then there is 
a transition by a from ki to m + 1 in Oracle{pa) . 

Proof. Let v = min(fc/), then is a suffix of wi = repetp(ki-i) (lemma^. As wi 
is a suffix of p, wia is a suffix of pa and m + 1 = poccur(w/(T). According to the 
construction of Oracle (per), there is a transition by a from fc/ to m + 1. □ 

Lemma 10 . Let ki < m be a state on the suffix path C Sp = {k^ = rn,ki . . . ,kt = 
0} ofm in Oracle{p = piP 2 . . .pm)+o. Ifki has a transition by a in Oracle{p)+a, 
then all the states ki^ I < i <t also have a transition by a in Oracle{p) + a. 

Proof. Let wi = repetp(fc/_i). All the Wi = repetp(ki-i), 0 < i < I are suffixes of 
wi. As wia is recognized by Oracle(p) + er, by lemmaH all its suffixes also are. 
□ 

The idea of the sequential construction algorithm is the following. According 
to the three lemmas^H^J to transform Oracle(p) + er in Oracle(per) we only 
have to go down the suffix path C Sp = {ko = m, k\, . . . , kt = 0} of state m and 
while the current state ki does not have an exiting transition by er, a transition 
by er to m + 1 should be added (lemma H. If ki already has one, the process 
ends because, according to lemma^J all the states kj after ki on the suffix path 
already have a transition by a. 

To add a single letter, the preceding algorithm is enough. But, as we build 
the automaton by adding the letters of p the one after the other, we must update 
the supply function Spa- of the new automaton Oracle(pcr). As (according to the 
definition of Sp), the supply function of states Q < i < m does not change from 
Oracle(p) to Oracle(pcr), the only thing to do is to compute Spa{m+ 1). This is 
done with the following lemma. 

Lemma 11 . If there is a state kd which is the greatest element of CSp = {fco = 
m, fci, . . . , = 0} in Oracle{p) such that there is a transition by a from kd to a 

state s in Oracle{p), then Spa{m + 1) = s in Oracle{pa). Else Spa = 0. 

Proof. Let w = repetpa{m + 1). First assume that there is no such state. As 
0 G CSp in Oracle(p), there is no transition by E leaving 0 in Oracle(p), so a 
does not occur in p and w = e and Spa = 0. 

We now assume that there is a such state kd. Then w is not the empty string 
and so we can write w = aa. Furthermore kd < m because m is the last state of 
Oracle(p). Let Wj = repetp(fci_i) for 0 < j < t and wq = p. We first prove the 
two following points: 
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(1) |a| < |wd_i|. Conversely assume that |a| > |wd-i|, the Wd-i is a suffix of a. 
As aa is a factor of p (it occurs twice in pa), aa is recognized by Oracle(p) 
(lemmaH and Wd-ia also is. This contradicts the fact that d is the greatest 
such that kd has a transition by a in Oracle(p). 

(2) Let i be the state recognizing a in Oracle(p); i is strictly inferior to kd-i 
according to lemmaHand to the fact that i has a transition by a whereas 
kd has not. 

We now compare a and Wd- The one is a suffix of the other. We get the two 
following cases: 

(1) Assume that |a| > |wd|. i > kd because Wd is a suffix of a. We conversely 
prove that kd = i- Assume that kd < i- Then \wd\ < |min(z)| because 
otherwise the path labeled by Wd will end by min(«) and will leads to state 
strictly before i which will contradicts lemmaH But min(z) also occurs in 
kd-i because: min(z) and min(fcd_i) are comparable (by suffix relation) and 
on account of corollaryJmin(fcd_i) can not be a suffix of min(«) so min(z) is 
a suffix of min(fcd-i). So that min(z) is a suffix of Wd-i which occurs twice 
in prefp(fcd-i) and which is strictly longer than Wd = repetp(fcd-i). This 
contradicts the definition of Wd- So kd = i with the result that aa leads to 
the same state as Wda: s. 

(2) Now assume that |a| < \wd\ then i < kd because a is a suffix of Wd- |o:| > 
|min(fcd)| because as there exists a transition from kd by a to s, imn(kd)a is 
both a factor of p and a suffix of pa and aa is the longest of these factors. 
min(fcd) is a suffix of a and by lemma|z > kd- From which i = kd and aa 
leads to the same state as Wda: s. 

Then the path aa also leads to s and therefore s = Spa-{m + 1). 

□ 

From these lemmas we can now deduce an algorithm add Jetter to transform 
Oracle(p) in Oracle(p(r). It is given Figure^ 

Lemma 12. The algorithm add-letter really builds Oracle(jpa) from Oracle{p = 
P 1 P 2 ■ ■ - Pm) and updates the supply function of the new state m+1 of Oracle{pa) . 

Proof. We go down the suffix path of p in accordance with lemmaO We stops 
in accordance with lemma^Jand we update the supply value of state m+1 
according to lemma^J □ 

The complete algorithm Oracle-sequential that builds Oracle(p = P 1 P 2 ■ ■ ■ 
Pm) just consists in upgrading the automaton by adding the letters pi one by 
one from left to right with the function add Jetter. 

Theorem 1. The algorithm Oracle-sequential (p = pip 2 ■ ■ -Pm) builds Oracle{p). 
Proof. By induction on string p using lemma^J □ 
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Function add_letter(Oracle(p = P1P2 ■ ■ -Pm), cr) 

1. Create a new state m + 1 

2. Create a new transition from m to m + 1 labeled by cr 

3. A: <— Sp{m) 

4. While A: > — 1 and there is no transition from A: by cr Do 

5. Create a new transition from A: to m + 1 by cr 

6. k ^ Sp{k) 

7. End While 

8. If {k = —1) Then s <— 0 

9. Else s <— where leads the transition from k by cr. 

10. Spa{m + 1) <— s 

11. Return Oracle(p = pip 2 •• -Pm cr) 



Fig. 3. Add a letter a to Oracle(p = pip 2 ■ ■ -Pm) to get Oracle(p(r). 

Theorem 2. The complexiti^f the algorithm Oracle-sequential (p = P 1 P 2 ■ ■ - Pm) 
is 0{m) in time and in spac^ 

Proof. The algorithm is in 0(m) in space. Indeed, all the transitions which are 
created by the algorithm are transitions of Oracle(p). Exactly m -I- 1 states are 
created and a supply value associated to each of these states can be stored in 
constant space. So the algorithm requires linear space. 

The algorithm is in 0{m) in time. As we only create the states and the 
transitions that are necessary, the only point to verify is that the total number 
of backward jumps on the supply path (lines 4-6, Figure^ is linear. 

In each stage i of the construction, i.e. when letter pi is being added, the 
number of backward jumps on the supply path is bounded by ki = |repetp(i — 
1)|. During the transition from stage i to stage i -I- 1, we have fci+i < ki — ri + 2 
and ri < ki — ki+i -I- 2. The sum D is therefore bound by 2n and the 
algorithm is linear in time. □ 

Example. The sequential construction of Oracle (0666006) is given in Figure^ 



3 SufRx Oracle 

The links between the suffix automaton and our factor oracle lead to a straight- 
forward extension of the oracle: It is possible to mark some states as terminal 
on the factor oracle as on the suffix automaton in order to recognize suffixes of 

^ The constants involved in the asymptotic bound of the complexity of the sequential 
construction algorithm depend on the implementation and may involve the size of 
the alphabet E. If we implement the transitions in a way that they are accessible in 
0(1) (use of tables), then the complexity is 0(m) in time and Odlil • m) in space. 
If we implement the transitions in a way that they are accessible in 0{log\E\) (use 
of search trees), then the complexity is 0{log\E\ ■ m) in time and 0{m) in space. 
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(a) (b) (c) Add b (d) Add b 

Add a 




(e) Add b 




(f) Add a 





(g) Add a 



(h) Add b 



Fig. 4. Sequential construction of Oracle(abbaba). The dot-lined arrows repre- 
sent the supply function. 



p. This extension will allow us to use some properties of the suffix automaton. 
We call this new structure sujfix oracle and we denote it by SOracle(p). 

Definition 2. A state q of the suffix oracle is terminal if and only if there is a 
path labeled by a suffix of p leading from the initial state to q. 

The high-level construction algorithm of the factor oracle (see Figure H can 
not be easily modified in order to build the suffix oracle because it can not 
detect terminal states. Conversely, the sequential construction algorithm can 
because of the supply function. This is the point of the following lemma. Let 
us recall that for Oracle(p = piP 2 ■ ■ - Pm), the sequence defined by fcp = m, 
ki = Sp{ki-i) for i >= 1 is finite, strictly decreasing and ends in state 0. We 
denote C Sp = {ko = m, ki . . . , kt = 0} the suffix path of p in Oracle(p). 

Lemma 13. The terminal states of SOracle{p) are the states of Oracle{p) that 
are on the suffix path CSp. 

Proof. 

(i) If fc S CSp, then fc is a terminal state of SOracle(p). We denote wq = p and 
Wi = repetp(ki-i) for all 1 < i < f. Corollary H sets that wi is a suffix of 
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all the Wi with 0 < i < I < t, so wi is a suffix of p. As Wi leads to ki with 
0 < i < t, the states ki are reachable by a suffix and so are terminal. 

(ii) If g is a state of Oracle(p) such that there is a path from the initial state 
0 to g labeled by a suffix s of p, then q G CSp. If q = m, q — k^ then the 
property holds. Assume now that 0 < q < m. As s is a suffix of p, there is 
an i such that s is a proper suffix of Wi-i (which leads to fci_i) and such 
that Wi (which leads to ki) is a proper suffix of s. According to lemmaHwe 
have ki < q < ki-\. 

Assume that ki < q < ki-\. Consider v = min(g). 

We first deal with the case where Wi = e {ki = 0). The path labeled by s 
which leads to q ^ 0 ends by f = min{q) which is not the empty string. As 
s is a proper suffix of Wi-i, v is also a suffix of Wi-i and ki-i G endpoSp(?;). 
So there are occurrences of v both in q < ki and in ki which contradicts 
e = Wi = repetp(wi-i). 

We can now assume that |wi| > 0. 

We first notice that |u| < |wi|. Indeed, conversely assume that |?;| > |wi|. We 
can not have |?;| = |wi| because in that case v = Wi (they are both suffixes 
of s) and q = poccur('y,p). As we assume that ki < q, this is a contradiction 
with lemmaj So |u| > |wi| but in that case as (1) q G endpoSp(u) (lemma 
Q (2) ki-i G endpoSp(t’) (lemmaj is a suffix of Wi-i) (3) q < ki (by 
assumption), it follows that u is a suffix of Wi-i, occurs strictly before fci_i 
and is greater than Wi. This contradicts the definition of Wi = repetp(i(;i_i). 
As |?;| < \wi\ and as Wi and v (lemmaH are suffixes of s, is a proper suffix 
of Wi. As u = min(q), lemmajcontradicts the fact that ki < q. 



□ 

To transform the factor oracle of p into a suffix oracle, we just go down the 
suffix path of the last state created by the sequential construction of Oracle(p) 
marking each encountered state as terminal. The pseudo-code of the construction 
of the suffix oracle (using the sequential construction of the factor oracle) is given 
in Figure 5 



Suffix-oracle(p = pip 2 . . - Pm) 

1. Oracle-sequential(p) 

2 . t ^ m 

3. While Sp{t) 7 ^ -1 Do 

4. mark t as terminal 

5. t ^ Sp{t) 

6. End While 



Fig. 5. Construction algorithm of the suffix oracle SOracle(p = P 1 P 2 ■ ■ - Pm)- 



For instance, the construction of SOracle(a666aa6) is given in Figure^ 
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Fig. 6. Example of suffix oracle. Double-circled states are terminal. The supply 
function is represented by dot-lined arrows. 



We mainly use the factor oracle rather than the suffix oracle because of 
the following reason. The power of the structure of factor oracle stands on its 
simplicity of construction (which becomes a little more complicated for the suffix 
oracle) but particularly on the few memory which is needed to implement it. This 
memory saving is based on the fact that the states of the automaton have not to 
be coded because we can consider a position in the string p as a state. So we only 
have to code external transitions which are at most m — 1. It is more difficult 
to do the same with the suffix oracle because we need a way of marking the 
terminal states. This complicates the implementation and slows the terminality 
test if you want to keep a sharp implementation. 

Note. For some strings, the suffix oracle matches the suffix automaton and there- 
fore recognizes exactly the suffixes. On a binary alphabet S = {0, 1}, it is notably 
the case for Fibonacci words and more generally for any left special factor of an 
infinite sturmian word (a factor u of a sturmian word is left special if and only 
if Ou and lu are both factors of the same sturmian word). The interested reader 
can refer to ^ for more details on this point. 



4 String Matching 

The factor oracle of p can be used in the same way as the suffix automaton in 
string matching in order to find the occurrences of a word p = piP 2 ■ ■ ■ Pm in 
a text T = tit 2 ■ ■ - tn both on an alphabet E. The suffix automaton is used in 
to get an algorithm called BDM (for Backward Dawg matching) . Its average 
complexity is in 0(nlog|j;|(m)/m) under a Bernoulli model of probability where 
all the letters have the same probability. Yao proved in th^it this bound is 
optimal. The BDM algorithm moves a window of size m on the text. For each 
new position of this window, the suffix automaton of p’' (the mirror image of p) 
is used to search for a factor of p from the right to the left of the window. The 
basic idea of the BDM is that if this backward search failed on a letter a after 
the reading of a word u then au is not a factor of p and moving the beginning 
of the window just after a is secure. This idea is then refined in the BDM using 
some properties of the suffix automaton. 

However this idea is enough in order to get an efficient string matching algo- 
rithm. The most amazing is that the strict recognition of the factors (that the 
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Fig. 7. Shift of the search window after the fail of the search by Oracle (p). The 
word au is not a factor of p. 



factor and suffix automata allow) is not necessary. For the algorithm to work, it 
is enough to know that au is not a factor of p. The oracle can be used to replace 
the suffix automaton as it is illustrated by Figure^ We call this new algorithm 
BOM for Backward Oracle Matching. Its proof is given in lemma^J We make 
the conjecture (according to the experimental results) that BOM is still optimal 
on average. 



BOM(p =P1P2 • . -Pm, T = tlt2 ...tn) 

1. Pre-processing 

2. Construction of the oracle of 

3. Search 

4. pos <— 0 

5. While {pos <= n — m) do 

6. state <— initial state of Oracle(p'^) 

7. j ^ rn 

8. While state exists do 

9. state <— image state by T[pos + j] in Oracle(p’') 

10. i ^ i - 1 

11. EndWhile 

12. Ifj^Odo 

13. mark an occurrence at pos -|- 1 

14. j 

15. Endlf 

16. pos <— pos +j 

17. EndWhile 



Fig. 8. Pseudo-code of BOM algorithm. 
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Lemma 14. The BOM algorithm marks all the occurrences of p in T and only 
them. 

Proof. 

— BOM only marks valid occurrences because the only word of length m rec- 
ognized by Oracle(p’') is p^ itself. 

— BOM marks all the occurrences of p. Indeed, assume on the contrary that 
there is an occurrence of p that no window matches. As the window shift 
is at most m, we have necessarily the situation described in Figure | {u 
may be the empty word) where Window 1 and Window 2 are consecutive in 
the algorithm. The recognition failure should have occurred in a, this is not 
possible because au is a factor of p. 



□ 

The worst-case complexity of BOM is 0{nm). However, in the average, we 
make the following conjecture based on experimental results (see Section : 

Conjecture 1. Under a model of independence and equiprobability of letters, 
the BOM algorithm has an optimal average complexity of 0{n\og^^^{m) / m) . 



Window 2 





Window 1 














1 1 1 1' : 


^ 1 Ml 






(7 







u 



Fig. 9. Impossible situation in the BOM algorithm during the search phase. 



4.1 Approach Using the SufRx Oracle 

The use of suffix oracle instead of factor oracle allows a refinement of the preced- 
ing approach. This refinement comes directly from the use of suffix automaton 
in BDM. During the backward search phase in suffix oracle, if a terminal state 
(which does not correspond to the whole word) is encountered, the position in 
the window is save in a variable last. This enables us to give a bound on the 
longest read factor which is a suffix of p’' ,i.e. a prefix of p. By saving the last 
state, we save a bound on the longest prefix and we can shift the window up 
to last (the string being or not being found in the current window). Figure^] 
illustrates this improvement. 

We call this algorithm BSOM for Backward Suffix Oracle Matching. Its 
complexity is still worst case 0{nm). 
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Fig. 10. Search with the sufhx oracle. The terminal state marking allows an 
improvement of the search based on the factor oracle. 



4.2 A Linear Worst Case Algorithm 



Even if the preceding algorithms are very efficient in practice, they have a worst- 
case complexity in 0{mn). There are several techniques to make the BDM al- 
gorithm (using suffix automaton) linear in the worst case, and one of them can 
also be used to make our algorithms linear in the worst case. It uses the Knuth- 
Morris-Pratt (KMP) algorithm to make a forward reading of some characters in 
the text. 

To explain the combined use of KMP and (factor or suffix) oracle, we consider 
the current position before the search with the oracle: a prefix v of the string 
has already been read with KMP at the beginning of the search window and we 
start the backward search using the oracle from the right end of that current 
window. The end position of v in the current window is called critical position 
and is denoted by Critpos. 

The current position is schematized in Figure 
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Fig. 11. Current position in the linear algorithm using both KMP and (factor 
or suffix) oracle. 
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We use the search with the oracle from right to left from the right end of the 
window. We consider two cases whether the critical position is reached or not. 

1. The critical position is not reached. The failure of the recognition of a factor 
occurs on character a as in the general approach (Figure^J. We shift the 
window to the left until its beginning goes past character a. We restart a 
KMP search on this new window rereading the characters already read by the 
oracle. This search stops in a new current position (with a new corresponding 
critical position) when the recognized prefix is small enough (less than am 
with 0 < a < 1). The value of a is discussed with the experimental results 
(see section ^3, typically a = 1/2. 

This situation is schematized in Figure 



Window 




Window shift 



Search by KMP algorithm 



Window 








End of the search by KMP 


v' Critpos' 




Back to the current position 


Window 





Fig. 12. First case: the critical position is not reached. 



2. The critical position is reached. We resume the KMP search from the criti- 
cal position, from the state we were before stopping, rereading at least the 
characters read by the oracle. We then go on reading the text until the 
longest recognized prefix is small enough (less than am). This situation is 
schematized in Figure 

This algorithm can be used with a backward search done with the factor ora- 
cle as well as with the suffix oracle (saving the last terminal state encountered) . 
We call these two algorithms Turbo-BOM and Turbo-BSOM. Concerning 
the complexity in the worst case, we have the following result. 

Theorem 3. The two algorithms Turbo-BOM and Turbo-BSOM are: (i) linear 
considering the number of inspections of characters in the text, the number of 
these inspections is less than 2n; (ii) linear considering the number of compar- 
isons of characters, the number of these comparisons is less than 2n when the 
transitions of the oracle are available in 0(1) and less than 2n -I- nlog |if| when 
the transitions are available in log |if|. 
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Fig. 13. Second case: the critical position is reached. 



Proof. 

(i) Each character in the text is read twice, once during the backward search 
using the oracle and once during the search with the KMP algorithm. The 
results follows. 

(ii) This complexity comes directly from the fact that the number of compar- 
isons done by the KMP algorithm is less than 2n. If the transitions of the 
oracle are available in 0(1), the backward search using the oracle does not 
require any comparisons (it only requires inspections), and the total num- 
ber of comparisons is bound by the number of comparisons in KMP: 2n. If 
the transitions are available in logT" (for instance with binary search trees), 
the number of comparisons done during the backward search is bounded by 
n log E and the total number by 2n -I- n log E. 

□ 

4.3 Experimental Results 

In this section, we present experimental results on the time complexity of our 
string matching algorithms, compared to the following algorithms: Sunday: the 
Sunday algorithm is often considered as the fastest in practice; BM: the 
Boyer-Moore algorithm Q; BDM: the classical Backward Dawg Matching with 
a suffix automaton Q; Suff: the Backward Dawg Matching with a suffix automa- 
ton but without testing terminal states, this is equivalent to the basic approach 
with the factor automator| BOM: the Backward Oracle Matching with the 

^ The suffix automaton without taking in account the terminal states (i.e. considering 
every state as terminal) and the factor automaton recognize the same language. The 
difference is that the factor automaton is minimal, so its size is smaller or equal 
than the size of the suffix automaton. But the difference of size is not significant 
in practice, anyway not enough significant to justify the implementation of a factor 
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factor oracle; BSOM: the Backward Oracle Matching with the suffix oracle 
(testing terminal states); Turbo-BOM: the linear algorithm using BOM and 
KMP with 0 = 1/2. 

Our string matching experiments are done on DNA sequences (we took the 
Archaeoglobus Fulgidus sequence of 2 MB) and on natural language (English - 
we took a compilation of Wall Street Journal articles of 10 MB). We also per- 
formed experiments on random texts for alphabets of size 2, 4, 16 and 32. Results 
are obtain with an accuracy oi +/- 2% with a confidence of 95% (which may 
require thousands of iterations). The machine used is a PC with a Pentium 11 
processor at 350MHz running Linux 2.0.32 operating system. For all the algo- 
rithms, the transitions of the automata are implemented as tables which allow 
0(1) branches. 

Experimental results in string matching are always surprising because codes 
are smalls and the time taken by a character comparison is not much greater 
than the time taken by an integer incrementation. It is for instance the reason 
why Sunday algorithm is the fastest algorithm for small strings: a window shift 
is usually very small but require very few operations. It is also the reason why 
BDM is slower than Suff and BSOM slower than BOM whereas the window shifts 
in BSOM and BDM are greater. When searching in sequences of characters, it 
is obviously useless to mark and test terminal states in both suffix automaton 
and factor oracle. 

The 4 sub-figures of Figure^Jshow that BOM is as fast as Suff (except on 
a binary alphabet) which is much more complicated and requires much more 
memory. BOM reads more texts characters than Suff, but as the oracle automa- 
ton is much smaller than the suffix automaton, it performs less memory page 
breaks and the experimental search times are the sames. 

Turbo-BOM algorithm is the slowest but it is the only one that can be used 
in real time and in that case its behavior is rather good. It has to be noticed 
that we arbitrarily set the value of a to 1/2. However, according to the tests we 
performed for different values of a, it turns out that a = 1/2 is the more often 
the best value and that the variations of search times with other values of a (as 
far as they stay between (21og|^| rr^jm and (m — 21og|^| m)lm ) are not very 
significant and anyway do not deserve by themselves an accurate study. 

5 Conclusions 

The two new structures we presented, the factor oracle and the suffix oracle, 
enable new string matching algorithms. These algorithms are very efficient in 
practice, as efficient as the ones which already existed, but are far more simple 
to implement and require less memory. According to the experimental results, we 
conjecture that they are optimal on the average (under a model of equiprobability 
of letters) but it remains to be shown. 

About the structure of factor oracle itself, many questions stay open. Among 
others, it would be interesting to have a characterization of the language recog- 



automaton which will complicate and slow the preprocessing phase of the string 
matching algorithm. 
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Fig. 15. Experimental results in time of the string matching algorithms on DNA 
sequences (left plot - 2MB) and Natural language (right plot - 10 MB). The X- 
axis represents the length of the string and the Y-axis the search time in 1/lOOth 
seconds per MBytes. 



nized by the oracle. It would also be of interest to study of the average number 
of external transitions in the oracle, to know the average memory space required 
by the string matching algorithms. 

We notice that the factor oracle is not minimal considering the number of 
transitions among the automata of m -I- 1 states which recognize at least the 
factors. This reduced automaton may also be used in string matching provided 
that its construction can be done in linear time. This construction remains an 
open problem. 

Finally, the factor oracle can be extended to a set of strings, and integrated in 
multi string matching algorithms. The experimental results are very promising, 
the new algorithms being by far the fastest in many practical cases Q. 
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(a) Factor oracle. 




(b) Reduced automaton. 



Fig. 16. The factor oracle is not minimal considering the number of transitions 
among the automata of m + 1 states which recognize at least the factors. 
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Abstract. The g-gram filter is a popular filtering method for approxi- 
mate string matching. It compares substrings of length q (the g-grams) 
in the pattern and the text to identify the text areas that might contain a 
match. A generalization of the method is to use gapped g-grams, subsets 
of q characters in some fixed non-contiguous shape, instead of contiguous 
substrings. Although mentioned a few times in the literature, this gen- 
eralization has never been studied in any depth. In this paper, we report 
the first results from a study on gapped g-grams. We show that gapped 
g- grams can provide orders of magnitude faster and/or more efficient 
filtering than contiguous g-grams. The performance, however, depends 
on the shape of the g-grams. The best shapes are rare and often pos- 
sess no apparent regularity. We show how to recognize good shapes and 
demonstrate with experiments their advantage over both contiguous and 
average shapes. We concentrate here on the k mismatches problem, but 
also outline an approach for extending the results to the more common 
k differences problem. 



1 Introduction 

Given a pattern string P of length m, a text string T of length n, and a dis- 
tance fc, the approximate string matching problem is to find all substrings of 
the text T that are within a distance k of the pattern P. The most commonly 
used distance measure, leading to the k differences problem, is the Levenshtein 
distance, the minimum number of single character insertions, deletions and re- 
placements needed to change one string into the other. A simpler variation, the 
k mismatches problem, uses the Hamming distance, the minimum number of 
replacements needed to change one string into the other, i.e., the number of 
mismatching characters. 

The fastest algorithm in practice for the k differences problem is the bitpar- 
allel dynamic programming algorithm of Myers It works in time 0{nm/w), 

* Supported by the DFG ‘Initiative Bioinformatik’ grant BIZ 4/1-1. 

** Partially supported by the 1ST Programme of the EU under contract number IST- 
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where w is the word size of the machine. An extensive survey and comparison of 
algorithms is given in The k mismatches problem is a simpler problem but 
we do not know of any asymptotically faster algorithms for it. As the pattern 
is usually short in comparison to the text, much faster string matching is often 
possible if the text has been preprocessed to build, e.g., the suffix tree of the 
text. This is called indexed or ojfline string matching. For the k differences and 
the k mismatches problems, dynamic programming over the suffix tree 
can be fast, but only for short patterns and small distances k. 

Filtering is a way to speed up approximate string matching. The idea is to 
narrow down the search to a small fraction of the text with some filtering method 
(the filtering phase) and search only those areas using a proper approximate 
string matching algorithm (the verification phase). A good filtering method is 
fast and efficient., i.e., leaves only a small area to be verified. A good survey of 
filtering methods is given in Among the most popular and studied filtering 
methods is the q-gram method. 

A q-gram is a substring of length q. The basic g-gram method works as follows. 
First, find all matching q-grams between the pattern and the text. That is, find 
all pairs {i, j) such that the g-gram at position i in the pattern is identical to the 
q-gram at position j in the text. We call such a pair a hit. Second, identify the 
text areas that have enough hits. These are the areas passed to the verification 
phase. There are different ways of defining the text areas and counting the hits 
in them (see, e.g., ^Q). However, all of them have the same threshold, the 
significant number of g-grams. This number is given by the q-gram lemma. 

Lemma 1 (The g-Gram Lemma Q). Let P and S be strings of length m 
with (Levenshtein or Hamming) distance k. Then P and S have at least t = 
m — q(k + 1) + 1 common q-grams. 

The threshold given by the lemma is tight in the sense that using any lower 
value might miss an occurrence (see Lemmas H and fl. For example, strings 
ACAGCTTA and ACACCTTA have Hamming and Levenshtein distance 1 and have 
8 — 3(1 + 1) + 1 = 3 common 3-grams: ACA, CTT and TTA. 

The above description of the g-gram method leaves many details open. Dif- 
ferent realizations of the method are described in . There are also many 

variations, e.g., not using all g-grams 

The g-gram method is particularly suitable for indexed string matching. An 
index of all text q-grams is simple to implement using table lookup, hashing 
or a trie. This makes the q-gram method very fast unless the number of hits 
is large. Thus, we would like q to be large since the number of hits decreases 
exponentially as q increases. On the other hand, as q increases the threshold 
given by the q-gram lemma decreases, which reduces filtering efficiency. The 
best trade-off depends on the implementation and the application. 

A generalization of the q-gram method uses gapped q-grams, subsets of q 
characters of a fixed non-contiguous shape. For example, the 3-grams of shape 
##-# in the string ACAGCT are AC.G, CA.C and AG.T. Gapped q-grams have been 
used in In the motivation is to increase the filtration efficiency by 
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considering multiple shapes. Pevzner and Waterman use g-grams containing 
every (fc+ l)st character together with contiguous q-grams for the k mismatches 
problem. The FLASH algorithm of Califano and Rigoutsos Q uses as many as 
40 different random shapes in a probabilistic manner, i.e., without a guarantee 
of finding all occurrences. Their approach is effective for high k but they need 
a huge index (18GB for a 100 million nucleotide DNA database). The Grampse 
system of Lehtinen et al. uses a shape containing every hth character for 
some h (similar to ^3) for exact matching. Their motivation of using gapped 
g-grams is to reduce dependencies between the characters of a g-gram. 

In this paper, we will show that gapped g-grams have advantages over con- 
tiguous q-grams even when using just one shape and not being concerned with 
dependencies between characters. Gapped g-grams of suitably chosen shape pro- 
vide much faster and/or more efficient filtering. We have observed improvements 
of several orders of magnitude in our experiments. 

The results in this paper apply only to the k mismatches problem. The k 
differences problem causes difficulties for gapped g-grams because they are af- 
fected by insertions and deletions in the gaps. However, the difficulties can be 
tackled by using multiple shapes. For example, we can use the shapes #####-## 
and ##### — ## to handle an insertion or a deletion in the gap of the shape 
##### — ##. We are currently working on this approach to extend our results to 
the k differences problem. As a preliminary result in this direction, we show that 
even g-grams with just one gap are better than contiguous ones. As the above 
example shows, g-grams with few gaps are of special interest for the k differences 
problem. 

2 Shapes 

A shape Q is a set of non-negative integers containing 0. The size of Q, denoted 
by IQIi is the cardinality of the set. The span of Q is s(Q) = maxQ + 1, i-e., 
the size of the minimum contiguous interval containing Q. A shape Q with size 
q and span s is called a g-shape or a {q, s)-shape. 

For any integer i and shape Q, the positioned shape Qi is the set {i + j \ j & 
Q}. Let Qi = {ii, i2 , . . . , iq}, where i = ii < 12 < ■ ■ ■ < iq, and let S' = S1S2 . . . Sm 
be a string. For 1 < i < m — s{Q) + 1, the Q-gram at position i in S, denoted by 
S[Qi], is the string Si^Si^ ■ ■ - Si^. Two strings P and S have a eommon Q-gram 
at position i if P[Qi] = S[Qi]- 

Example 1. Let Q = {0, 1, 3, 6} be a shape. Using the notation from the intro- 
duction Q is the shape ##-# — #. Its size |Q| = 4 and its span s{Q) = 7. The string 
S = ACGGATTAChas three Q-grams: S[Qi] = S1S2S4S7 = ACGT, S[Q2] = CGAA and 

S[Q3 ] = ggtc. 



3 Threshold 

The g-gram lemma does not apply to gapped g-grams in the form of LemmaJ A 
straightforward generalization would give a threshold oit = m— s(Q) — |Q|fc-|- 1 
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(see Lemmaflbelow) . Pevzner and Waterman [i0<;| give this threshold for shapes 
of form {0, h, 2h, . . . ,{q — This threshold for gapped shapes is strictly 

worse than for contiguous shapes of the same size. However, the threshold is not 
tight for gapped g-grams as shown by the following example. 

Example 2. Let m = 11 and fc = 3 and consider the 3-shapes ### and ##-#. 
The above threshold for the two shapes are 0 and —1, respectively. Thus, nei- 
ther shape would seem to be useful for filtering in this case. However, the real 
threshold for the shape ##-# is 1. By full enumeration of all combinations of 
3 mismatches it is possible to verify that at least one ##-#-gram is always un- 
affected by the mismatches. The following figure gives an example of a worst 
possible combination of 3 mismatches for both shapes. 
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We will next define the tight threshold for arbitrary g-grams. 

Let P — pi, . . .pm and S' = si . . . Sm be two strings of length m. Let R{P, S) 
be the set of positions where P and S do not match, i.e., R{P, S) = {i G 
{1, . . . , m} I Pi ^ Si}. Then \R{P, S)| is the Hamming distance of P and S. 

To determine the common Q-grams of P and S only the mismatch set R{P, S) 
is needed: P[Qi] = S[Qi] if and only if Qi H R{P, S) = 0. The minimum number 
of common g-grams is the threshold value needed for the g-gram method. 

Definition 1. Let m and k be non-negative integers and Q a shape. The thresh- 
old of Q for pattern length m and Hamming distance k is 

t{Q,m,k)= min |{z G {1, . . . , m - s(Q) -I- 1} | Qi n i?}|. 

From the above discussion we get the following tight form of the g-gram 
lemma for arbitrary shapes. 

Lemma 2 (The Q-Gram Lemma). Let Q be a shape. For any two strings P 
and S of length m with Hamming distance k, the number of common Q-grams 
of P and S, i.e., the size of the set {i G {1, ... , m — s(Q) -I- 1} | P[Qi] = S[Qi]}) 
is at least t{Q, m, k). Furthermore, there exists two strings P and S of length m 
and Hamming distance k, for which the number of common Q-grams is exactly 
t{Q, m, k). 

^ They also give a better threshold for the case when the span of the shape is close to 
the length of the pattern. 
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We have not found a closed form for the exact threshold in the general case. 
The following lower bound we already saw in the beginning of this section. 

Lemma 3. t{Q, m, k) > max{0, m — s{Q) — |Q|fc + 1}. 

Proof. Let R be the set minimizing the expression in the definition of m, k). 
For each j G R there is exactly \Q\ integers i such that j G Qi. Therefore, at 
most k\Q\ of the positioned shapes Qi, i G {1, . . . ,m— s(Q) + 1}, intersect with 
R, and at least m — s(Q) — k\Q\ + 1 do not intersect with R. □ 

Tighter bounds may be given for special cases. In particular, the old q-gram 
lemma for contiguous g-grams gives indeed the exact threshold as shown by the 
following lemma. 

Lemma 4. Let Q be a eontiguous shape, i.e., Q = {0, 1, . . . , 9 — 1}. Then 
t{Q, m, k) = max{0, m — s{Q) — |Q|fc + 1} = max{0, m — q{k + 1) + 1}. 

Proof. The lower bound is shown by LemmaH Let R = {q, 2q, . . . , kq}. Then 
Qi intersects with R if and only if z G {1, . ■ . , kq}, and thus does not intersect 
with R if i G {kq + 1, . . . ,m — q + 1}. This shows the upper bound. □ 

Using the lower bound of Lemma Q as the threshold guarantees that all 
approximate occurrences are found, but it is very inefficient choice for gapped 
shapes. A difference of just one in the threshold value used makes a big difference 
in the efficiency of filtering. We have computed the exact thresholds for all shapes 
for m = 50 and k G {4, 5}. TablesHandHgive the highest threshold among {q, s)- 
shapes for all combinations of q and s that have shapes with positive thresholds 
(except q = s = 1). 

The tables show that Example H was not an isolated case: in many cases, 
especially for higher values of q, best gapped shapes have much higher thresholds 
than contiguous shapes of the same or even smaller size. Thus, one can use a 
higher value of q to get fewer hits, or have a higher threshold and better filtration, 
or even both. 

However, it is not sufficient just to have gaps; the shape has to be cho- 
sen carefully. For instance, for the parameters of Example ^ m = 11, fc = 3, 
q = 3, the shape ##-# and its mirror image #-## are the only ones that have 
a positive threshold. As a more impressive example, for the parameter values 
m = 50, k = 5 and q = 12, there are only two shapes, ###-# — ###-# — ###-# and 
#-#-# — # #-#-# — # #-#-# — #, (and their mirror images) with a posi- 

tive threshold. In most cases shown in TablesOandO only a few shapes achieve 
the highest threshold. The distribution of threshold values for one typical case 
is shown in Figure U 

4 Minimum Coverage 

The filtering efficiency of a Q-gram clearly depends on the threshold t{Q, m, k). 
However, the correlation is not direct. The following example shows an additional 
property of shapes that can affect the filtering efficiency. 
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Example 3. Let m = 13 and fc = 3. Then both shapes ### and ##-# have a 
threshold of two. If two strings have four consecutive matching characters, they 
have two common 3-grams of shape ### . In contrast, to have two common 3- 
grams of shape ##-# two strings need to have at least 5 matching characters. 

Motivated by the example, we define the following measure. 

Definition 2. Let Q be a shape and t a non-negative integer. The minimum 
coverage of Q for threshold t is 

c{Q,t)= min \UiecQi\- 
CCN,|C|=i 



The minimum coverage is, in essence, the minimum number of characters 
that need to match between a pattern and a text substring for there to be t 
matching Q“grams. This gives a reasonable first order estimator for the prob- 
ability of random strings having t common Q-grams. We do not analyse this 
further in this paper, but our experiments support the conjecture that there is a 
strong correlation between the minimum coverage c{Q, t{Q, m, k)) and the filter 
efficiency (see Figure^™ Section^. 

We have computed the minimum coverages of all {q, s)-shapes for m = 50 
and k G {4, 5}. Tablesjand^give the highest minimum coverage among {q, s)- 
shapes for all combinations of q and s that have shapes with positive thresholds 
(except 9 = s = 1). The tables show that considering the minimum coverage 
further improves the advantage of the best gapped shapes over the contiguous 
shapes. There are even cases where the contiguous shape is the g-shape with the 
highest threshold but some gapped g-shapes have a higher minimum coverage. 

As with threshold values, the shapes with the highest minimum coverage are 
rare. Figurejshows the distribution of minimum coverages in one typical case. 




Fig. 1. Distribution of thresholds and minimum coverages of (8, 17)-grams for 
m = 50, fc = 5. Mirror images and one shape with threshold 0 are not included. 
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Table 1. The best thresholds/minimum coverages for m = 50 and fc = 4. 
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“ The shapes with the highest threshold and the highest coverage are different. 
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Table 2. The best thresholds/minimum coverages for m = 50 and k = 5. 
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“ The shapes with the highest threshold and the highest coverage are different . 
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5 Experiments 

To test the g-grams in practice, we performed some experiments on DNA data. 
We used two different databases of size 50 million, one randomly generated (with 
even and independent distribution of characters), the other containing the first 
50 million basepairs of the GenBank Mouse EST database. The queries we used 
were random strings of length 500. However, the threshold used in filtering was 
computed for m = 50. The effect is that the filtering is guaranteed to report 
all positions where there is an approximate occurrence of a substring of length 
50 of the queryj The distance k varied between 3 and 6. The experimental 
setting corresponds to the high similarity local alignment problems in shotgun 
sequencing and EST clustering E3- No actual matches were found in the 
databases, i.e., all potential matches reported by the filtering were false positives. 

For a filter algorithm the two main properties of interest are speed and fil- 
tering efficiency. As a measure of these properties we use the number of hits and 
the number of matches^ respectively. A hit is a pair {i,j) such that the query 
q-gram at position i matches the database q-gram at position j. The time to 
process the hits usually dominates the running time of the filtering phase. A hit 
(i,j) is counted for the position j — i. A match is a position that has at least t 
hits. The number of matches reflects the amount of work that the verification 
phase must do. 

The expected number of hits is proportional to Our conjecture is that 
there is a similar dependence between the number of matches and the minimum 
coverage of the shape. We tested a large number of shapes using different values 
of k and the two databases described above. In figureHthe relation between the 
minimum coverage and the number of matches per billion characters are shown. 
In most cases, the number of matches was computed from an average over 100 
queries, although for some of the shapes, 1000 queries was used. It is clear that 
there is a strong but not stringent correlation of the form we expected. 

For a more detailed comparison of shapes, we chose four classes of shapes for 
different values of q and k: 

— Best. This is the shape with the highest minimum coverage. To choose be- 
tween multiple shapes with the same coverage we used the number of distinct 
covers of the minimum size, the number of distinct covers of the minimum 
size plus one, and the threshold as secondary keys (in this order). 

— Median. For each span s, all (g, s)-shapes (without mirror images) were or- 
dered by the minimum coverage, and by the same secondary keys as for 
the best shape, and the median shape in this order was identified. Of these 
shapes (one for each s) the best one was used in the experiments. The chance 
that a randomly chosen shape is better than this shape is at most half. 

^ A better filter efficiency could be achieved by counting the hits separately for each 
substring of length 50, but we chose the simpler approach. This should not have a 
significant effect on the relative performance of different shapes. However, looking 
at the results of the experiments one should keep in mind that we do not achieve 
maximal filter efficiency. 
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minimum coverage 



Fig. 2. Correlation of minimum coverage and filter efficiency. 



— 1-gap. The best shape with exactly one gap. 

— Contiguous. 

The chosen shapes except the contiguous ones are shown in Table^ A missing 
shape means that the shape in that category had a threshold of 0. Figure | 
compares the chosen shapes both in theory (q vs. minimum coverage) and in 
practice (hits vs. matches). The experimental results are the averages from 1000 
queries against the random database. The missing datapoints in the experimental 
graph either had a threshold of 0 (bottom of the graph) or had no matches at 
all (top of the graph) . 

There are several things of interest in Figure ^ First, there is a high corre- 
lation between theoretical and experimental behavior. Second, the performance 
of the median shapes shows that while a randomly chosen shape is likely to be 
better than a contiguous one, still much better results can be achieved by a care- 
ful choice of the shape. Third, the shapes with one gap, which are of particular 
interest for the k differences problem, are not as good as the best ones overall 
but still much better than contiguous ones. Finally, the best shapes have several 
orders of magnitude better performance than the contiguous shapes. 

6 Concluding Remarks 

We have shown that suitably chosen gapped g-grams can significantly improve 
the performance of the basic g-gram filtering for the k mismatches problem. 
While interesting in itself, it also opens the door to the possibilities of gapped 
g-grams in the numerous other algorithms and applications, where contiguous 
g-grams and related methods have been found to be useful. In fact, most filtering 
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q vs, minimum coverage for k=4 




q vs. minimum coverage for k=5 




hits vs. filter efficiency for k=4 
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Fig. 3. Comparison of four classes of shapes. 
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methods for string matching use short substrings one way or another. We are 
currently working on extending the use of gapped q-grams to the k differences 
problem. 

There may be applications even beyond string matching. For example, in 
DNA sequencing by hybridization (SBH) the problem is, in essence, to construct 
a string given its g-grams. Preparata et al. have recently shown that 

SBH can be significantly improved by using gapped probes (g-grams) instead of 
contiguous ones. 
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Abstract. Many problems depend on a reliable measure of the distance 
or similarity between objects that, frequently, are represented as vectors. 
We consider here vectors that can be expressed as bit sequences. For 
such problems, the most heavily used measure is the Hamming distance, 
perhaps normalized. The value of Hamming distances is limited by the 
fact that it counts only exact matches, whereas in various applications, 
corresponding bits that are close by, but not exactly matched, can still 
be considered to be almost identical. We here define a “fuzzy Hamming 
distance” that extends the Hamming concept to give partial credit for 
near misses, and suggest a dynamic programming algorithm that permits 
it to be computed efficiently. We envision many uses for such a measure. 



1 Introduction 

The Hamming Distance has long been used to quantify the extent to which two 
bit sequences, or bitmaps, of the same dimension, differ. An early application 
was in the theory of error-correcting codes (see, e.g., Q), where the Hamming 
distance measured the error introduced by noise over a channel when a mes- 
sage is sent between its source and destination. Within an Information Retrieval 
environment, bitmaps may indicate the documents a term occurs in; in such 
applications, the Hamming distance quantifies differences in the occurrence pat- 
terns of terms. 

In a traditional application of the Hamming distance, the only concern is 
whether the corresponding bits in two strings agree; the distance doesn’t dis- 
tinguish whether a discrepancy between a target and source 1-bit are separated 
by one or many positions. Consider the following target bitmap (a), and the 
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candidate source bitmaps (b) and (c): 

(а) 1100100000 

(б) 1100010000 
(c) 1100000001 

As assessed by the traditional Hamming distance, both (b) and (c) are equally 
good efforts to match the target: both differ from it by a Hamming distance of 
2. But intuitively, one is inclined to regard (b) as a better match than (c): while 
both (b) and (c) fail at matching the last 1-bit, (b) misses by only one unit, 
which may be quite acceptable for several applications. If so, we would like, for 
such a distance measure, that (b) be assessed as closer to (a) than is (c). In 
general, there is a great deal of arbitrariness in defining a measure of goodness. 
But a minimal desideratum of a measure of quality is that it at least satisfy such 
intuitive criteria as suggested by the applications for which it is intended. The 
fuzzy Hamming distance we define below introduces this flexibility. 

As a possible application, consider the problem of automatically segmenting 
a body of text In this context, it is useful to have a measure of how 

well an algorithmic text segmentor agrees with a target partition, as defined, 
for example, by a judge. In this case, both the target and the source can be 
represented by bitmaps, with each sentence (or other text unit) being represented 
by a bit position, and 1-bits indicating segment boundaries. Here, ideally, we 
should have an exact matching of 1-bits, and the number of discrepancies is the 
most obvious measure of algorithm failure. However, unlike the many coding 
applications in which the Hamming distance is used, in the text segmentation 
context we have a notion of bit-site proximity, and the fuzzy measure takes this 
into account. 

A further application is term clustering Q: bitmaps may be used to indicate 
the units (sentences, paragraphs, etc.) within a single document that a given 
term occurs in; in this context, it is troubling that two terms that tend to occur 
“close” to one another, even if not in exactly the same units, are assessed by the 
Hamming distance in the same way as terms that are completely unrelated to 
one another. That is, the original Hamming distance does not recognize the idea 
of neighboring units. 

Furthermore, it is evident that the new measure could be used in image 
processing, for example in the compression of black and white images and edge- 
detection, with possible applications to vector quantization, computer vision, 
robotics and fax transmisions. 

An example directly parallel to the segmentation problem is measuring the 
effectiveness of techniques for parsing Chinese text into words Q. A more flexible 
Hamming distance can also be of use in situations which require assessing the 
closeness of pairs of event sequences, in which nearby events are considered to 
be associated; such a requirement occurs in some data-mining applications [((]. 

It is our intention in this paper to extend the classic Hamming distance to be 
sensitive to situations in which the concept of neighboring locations is important. 
This extension will be defined in the next section in terms of the operations 
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necessary to transform the source bitmap to target bitmap. Conditions sufficient 
for this measure to be a true distance will be derived. One important motivation 
for constructing the new measure is that we believe it to be more reliable than 
the crisp measure when the exact placement of 1-bits is influenced by random 
effects. We will test this by comparing the two measures on simulated sets of bit 
string. 



2 Fuzzy Hamming Distance 

The traditional definition of a Hamming distance is simple and intuitively ap- 
pealing: given two bit-strings (more generally, strings of symbols) of the same 
dimension, the Hamming distance is the minimum number of symbol changes 
needed to change one bit-map into the other. One can view this as a type of 
edit-distance but with a highly restricted set of edit operations. 

The Fuzzy Hamming distance we are introducing is also a type of edit dis- 
tance. But by extending the edit-operation set, we are able to recognize a notion 
of neighborhood that gives credit for near misses. Unlike most edit distances, 
ours compares pairs of fixed size bitmaps instead of general strings. This sim- 
plifies our task in that it fixes the size of the strings, and we need concentrate 
only on the 1-bits. However, it relies critically on a “shift” operation that while 
important for us, has not gotten much, if any, attention in the string literature. 

Suppose then that we wish to measure the distance between two bitmaps. 
For many applications, these enter symmetrically, but this need not be the case. 
For example, one bitmap may represent a target bitmap {Bt)', the other may be 
a source bitmap {B$) that is the output of an algorithm that is trying to match 
that target. In both cases, we wish to measure how similar the two bitmaps are. 
Ideally, Bs and Bt should be identical. 

Notationally, we let M denote the dimension of our bitmaps. For example, 
we might have a bit-site for each of M sentences, with a 1-bit indicating that 
the corresponding sentence ends a document segment. We shall use the nota- 
tion N{B) to denote the number of 1-bits in the bitmap B. We next define 
the fuzzy Hamming distance as an edit distance that measures the difficulty of 
transforming one such bitmap (say Bs) into the other (Bt)- 



2.1 Edit Distance: Overview 

To compute an edit distance we first define a set of elementary edit-operations, 
and associate with each a cost. The edit operations we define must certainly 
include the insert/ delete operations of the crisp Hamming distance. In addition, 
we introduce a shift operation that allows us to transfer a 1-bit in B$ to a nearby 
1-bit in Bt at less cost than deleting the 1-bit in Bs and inserting it in Bt- The 
shift operation is an abstraction of the concrete task of attempting to match 
a 1-bit in a target and missing, but getting close — it thus captures, for our 
measure, the notion of neighboring bit-sites. 
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Given a set of elementary operations, with a cost assigned to each, a measure 
of the difference between two bitmaps can be computed by using these operations 
to transform one to the other, and adding up the costs of the operations used. To 
eliminate the ambiguity attached to the multiplicity of possible transformation 
sequences, we quantify how close one bitmap is to another by the minimum cost 
over sequences of elementary operations that transform the source bitmap into 
the target bitmap. 

Computing edit distances typically requires processing whole strings. How- 
ever, the special nature of our problem will allow us to focus on only the 1-bits of 
the bitmap, which greatly reduces the computational cost of our algorithm. Thus, 
to develop our algorithm, it is convenient to describe a participating bitmap by 
a list of the index values, in ascending order, of its 1-bits. For example, we might 
represent the bitmap Bs by S' = (si, S 2 , . . . , Sjv(s)), where Si is an integer denot- 
ing the position of the i-th 1-bit of the bitmap Bs, and N{S) = N{Bs). Below 
we shall refer to both representations, Bs and S, as “the source bitmap” . 



2.2 Edit Distance: Details 

In general, let c{i,j) denote the minimum cost of transforming the first i 1- 
bits of a source bitmap, S, into the first j 1-bits of a target bitmap, T, where 
1 < z < N{S) and 1 < j < N{T). Our objective, then, is to compute the 
distance d{S, T) between the bitmaps S and T as c{N{S), N{T)), the minimum 
cost of transforming the source bitmap S into the target bitmap T. We will 
represent the function c in a tableau, whose values will be computed by means 
of a dynamic programming technique, as is standard in the string processing 
literature The dynamic programming technique further assures that we 

satisfy the desired constraint that any two shift operations do not cross: i.e., if Si 
is shifted to tj and Si' is shifted to tji, then the dynamic programming technique 
assures that if z < z', then j < f . 

To define our edit distance, we use the following elementary operations, mo- 
tivated by reference to a problem in which a source bitmap is created with the 
intention of matching a target bitmap: 

— Insertion: Here the algorithm generating the source is considered to have 
missed the j-th 1-bit of the target. To correct, a 1-bit is inserted into the 
source at location Sj, incurring a cost c/ > 0. If this operator is applied in 
an optimal sequence of operations taking (si , . . . , sz) to {ti, . . tj), then 

c{i,j) = Cl + c{i,j -1), 

and the j-th target bit is considered disposed of. Note that this insertion 
doesn’t preclude the possibility of a 1-bit already being present in Bs, as 
might be the case if z > j. But permitting this greatly simplifies our evaluat- 
ing this measure. Also note that in our problem, we are changing 1-bits. But 
the lengths of the bitmaps are fixed — only the number of 1-bits is changed. 
Thus when defined as operations on bitmaps, our terminology is somewhat 
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non-standard, with insertion denoting a change of bitmap value rather than 
an operation that increases the size of the bitmap. Similar considerations 
apply below, when we define deletions. 

— Deletion: Here the z-th 1-bit of the source is assessed as spurious. The 1-bit 
is changed to a 0-bit, incurring a cost cd > 0. If this is optimal, then 

= CD + c(z- 1, j), 

and the z-th source bit is considered disposed of. 

— Shift: Here the j-th 1-bit of the target and the z-th 1-bit of the source are 
considered to represent the same bit value, but misaligned by a small amount. 
That is, the source generator correctly sensed the need for a 1-bit, but its 
exact location may have been in error. Now the source 1-bit is shifted A 
locations to align it with the target 1-bit; here, Z\ is a non-negative value, 
insensitive to the direction of the shift. The cost incurred by this operation 
is given by cs(Z\), a non-negative function, monotonically increasing with 
A. If the match was accurate, then Z\ = 0, with cs(0) = 0, will denote the 
null operation. If a shift operation is optimal, then, for A = \ j — i\, 

c{i,j) = cs{A) + c{i-l,j-l), 

and the z-th source bit and j-th target bit are considered disposed of. Most 
simply, we can assess cs{A) = A A, for some non-negative constant A. We 
would want to adjust A relative to c/ and cd so that for A “large,” it is 
cheaper to delete and insert rather than shift, while the opposite is true for 
small A. Although the shift operation is unconventional in the string process- 
ing literature, it is easily accommodated within the dynamic programming 
framework. 

To implement the dynamic programming method, we initialize the tableau 
by inserting the following boundary values: 

c(0,j)=jc/ and c(z, 0 )=zcd- 

The optimal costs are then developed recursively as indicated in the definition 
of the operations. Note that the size of the tableau is N{S)N{T), rather than 
which may dramatically reduce the complexity of the computation. Further 
efficiencies are possible if the bitmaps are sparse or the maximal shift distance 
is sufficiently small. 

Below, when we wish to reveal all parameters in the edit distance between 
two arbitrary source and target bitmaps of the same dimension, represented as 
above by S and T, we adopt the notation d(5, T ; c/, cd. A). 

3 Properties 

The fuzzy Hamming distance has some very interesting properties and relations 
to other functions. We first examine the conditions under which the fuzzy Ham- 
ming distance can indeed be shown to be a distance. To do this it is convenient 
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to define a function on the integers, as concave as follows: given integers r, s, t, u, 
with r < s < t < u, then a function / defined on the integers is concave provided 

f{u) - f{t) ^ f{s) - /(r) 
u — t ~ s — r 

Theorem. The fuzzy Hamming distance, d{S,T), is a true distance function, 
if, when x > 0 denotes the absolute size of a shift: 

(a) cs{x) > 0, taking the value 0 if and only if x = 0; 

(h) cs(x) increases monotonically; 

(c) cs(x) is concave on the integers; and 

(d) CD = Cl > 0. 

Proof. For the fuzzy Hamming distance to be a metric, three conditions must be 
satisfied. 

Positivity: d{S,T) > 0. Clearly, if 5 = T, no operations (technically, only 
shifts of length zero) are required to transform one to the other, so for this case 
d{S, T) = 0. On the other hand, if the bitmaps are not identical, at least one non- 
trivial operation must be applied, incurring a positive cost. Thus, d{S,T) > 0, 
taking the value zero only for identical bitmaps. 

Symmetry: d{S,T) = d{T,S). For any sequence of operations, oi, 02 , . . . , o„, 
taking 5 to T, a complementary sequence of operations can be defined: o'^, 

. . where 

{ delete if Oi = insert 

insert if Oi = delete 

shift{-j) if Oi = shift{j). 

Clearly, the complementary sequence systematically undoes the effect of the 
original sequence, and acting on T, transforms it to S. Under the conditions of 
the theorem, it incurs the identical cost. Since this is true as well for an optimal 
sequence, we see the symmetry condition is satisfied. 

Triangle Inequality: For bitmaps S, T, and U, d{S,T) < d{S,U) + d{U,T). 
Consider an optimal sequence of operations, t\, that transforms S to T. Each 
operation either processes a 1-bit of S (by deleting it); processes a 1-bit of T 
(by inserting it); or both (by shifting the bit from S onto the bit in T). It is 
convenient for our proof to include a null operator (a shift of zero distance), so 
that below, a shift denotes a non-trivial movement. 

To prove the triangle inequality, the costs of operations that transform S to 
T via the intermediary U must evaluated, and compared to the cost of ti. So 
consider the sequence of operations, t 2 , first taking S to U; then the sequence, 
ta, taking U to T. 

Consider, then, a 1-bit in S. It could be considered disposed of in any of the 
following cases: if it is 
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i) deleted in t^', 

ii) left alone in both t2 and ts; 

iii) left alone in t2 and deleted in fa; 

iv) left alone in t2 and shifted in ta; 

v) shifted in t2 and left alone in fa! 

vi) deleted in t2, then reinserted in 

vii) shifted in t2 then deleted in fa! or 

viii) shifted in both t2 and ta- 

A similar breakdown is possible to describe how a 1-bit in T could be dis- 
posed of. In each case, the set of operations similar to i)-v) in effect describes 
operations legitimate in transforming 5 to T directly. But since is optimal, 
these operations, considered as a description of a transformation of S to T, col- 
lectively can only increase the cost or leave it unchanged; this is consistent with 
the triangle inequality. 

Any disposition involving two operations that cancel, as in vi), or a single 
shift combined with a non-null operation, as in vii), can only increase the cost 
relative to the disposition of the bit in ti, again in accordance with the triangle 
inequality. By systematically examining all possibilities in this manner, it is 
straightforward to conclude that the only way the triangle inequality can break 
down is if the concatenated sequence t2 ts shifts a bit twice: in effect shifting 
the bit from its initial location in S to its final destination in T by means of two 
shifts (a disposition illegal when directly transforming S to T). 

Thus, to prove the triangle inequality we need to examine only the case 
when two successive shifts, say of lengths A and B, are encountered. Consider 
two cases: (a) both shifts are to the same direction, resulting in the combined 
shift length A-l-B and (b) one shift, say the first, requires a reverse shift for the 
second, resulting in the combined shift length A-B. These conditions imply that 
for the triangle inequality to be valid, the following inequalities must hold for 
A,B > 0 and A > B: 

cs{A) + cs{B) > cs{A + B) 
cs(A) -I- cs{B) > cs{A - B). 

However, the second inequality is a trivial consequence of the monotonicity prop- 
erty, since cs(A) alone is greater than cs{A — B), and gives nothing new. So we 
need focus only on the consequences of the first inequality. 

First note that the triangle inequality is trivially valid if one term, say B, is 
zero. So consider two integers, A > H > 0. If cg(.) is concave on the integers, 
then certainly, 

cs{A + B) - cs{A) ^ cs{B) - cs(0) _ cs{B) 

B - B ~ B ' 

Thus cs{A + B) < eg (A) -|- cs{B). That is, the concavity property is sufficient 
to assure the triangle inequality, as was to be proved. | 

The class of integer concave functions is quite broad. In particular, if cg(.) is 
linear, or the integer restriction of a traditionally concave function defined over 
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the reals, then we can assume that our measure is a distance, and benefit from 
the intuition that this provides. 

The edit distance we defined above is very general, and it is interesting to 
relate it to other distance-measures between bitmaps. 



Hamming Distance: An obvious measure for the distance between two bitmaps 
is the simple Hamming distance: the minimum number of bits that must be 
changed so the source and target agree. The Hamming distance between the 
bitmaps represented by S and T may be expressed as a special case of the edit 
distance: d{S, T; 1,1, oo). That is, our measure is a genuine generalization of the 
traditional Hamming distance. 



Recall/Precision Type Measures: In Information Retrieval, one customarily 
uses two measures to evaluate performance. Recall indicates the fraction of all 
relevant documents that appear in a retrieval set, while precision indicates the 
fraction of documents in a retrieval set that are relevant. These measures can be 
adapted to evaluating the distance between a source and target bitmap To 
do this, we define two functions: 



p{S,T) 



N{S AND T) 

W) 



r{S,T) 



N{S AND T) 

W) 



where the AND operator acts on the bitmaps indicated by S and T. 

Thus p{S, T) is the fraction of the 1-bits of the bitmap represented by S to 
be evaluated which indeed match the 1-bits of the target bitmap represented by 
T, e.g., the percentage of segment boundaries produced by our algorithm, which 
are “correct” in the sense that they are also boundaries in the given reference 
partition; similarly r{S, T) is the fraction of the 1-bits of the target bitmap that 
have corresponding 1-bits in the source bitmap (e.g. the percentage of segment 
boundaries of the reference partition that are detected by our algorithm) . 

We thus assign, after having fixed T, a pair of numbers (r, p) to each partition. 
Relative to T, we consider a partition represented by 5i to be better than another 
partition, represented by ^ 2 , if 



p{Si , T) > p{S 2 , T) and r (5i , T) > r (^ 2 , T) . 



Generally, however, the values r and p tend to be inversely related, and trying to 
raise one usually lowers the other. By changing the parameters in a segmentation 
algorithm, one could get a series of (r, p)-pairs, and produce curves similar to 
the recall/precision curves in Information Retrieval. 

While these measures are interesting because of their relation to the tradition 
of research in information retrieval, in fact, they can be simply expressed in terms 
of our edit distance. Letting 0 denote the zero-bitmap, we find 



d{S, 0 ; 0, 1, oo) — d{S, T; 0,1, oo) 
d{S, 0 ; 0, 1, oo) 



P{S,T) 
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r(5,T) 



d{S, 0; 0,1, oo) — d{S, T; 0,1, oo) 
c?(T, 0; 0, 1, oo) 



Since recall and precision can be expressed in terms of fuzzy costs, this rep- 
resentation immediately suggests an interesting generalization. By relaxing the 
infinity value in the cost, we can define fuzzy versions of the classic recall and 
precision measures as used in this context. 



4 Preliminary Tests on Text Collections 

Our primary motivation is to develop a distance measure that is suitable for 
bitmaps, and also takes into account that bit locations have a proximity property. 
Such a distance would have face validity for many problems, such as those noted 
in our introduction, in which bitmaps whose corresponding one-bit locations 
are nearby should be considered closer than pairs of bitmaps, with the same 
numbers of one-bits, but in which the locations of corresponding one-bits are 
well separated. The measure described above has this quality, and in addition 
reduces to the conventional Hamming distance when our shift cost is made large 
enough. 

But while the fuzzy Hamming distance’s face validity justifies its develop- 
ment, we expect that such a measure would also offer performance advantages 
in situations in which bitmaps are generated in a noisy environment. Specifically, 
we speculate that an important distinction between the Hamming distance and 
our fuzzy generalization will be differences in the response of these measures to 
random influences. One way to test this in a controlled manner is to specifically 
introduce random effects by means of a simulation. We have devised a variety 
of such simulations, but the details are deferred to a follow-up paper. 

In this section we shortly report on some simple tests which empirically assess 
the usefulness of the fuzzy Hamming distance. 

The first test considers the segmentation problem mentioned above. The 
chosen text were the hundred first sentences of Ernest Wright’s novel Gadsby, in 
which the letter E never appears. The text was prepared as a numbered sequence 
of sentences, but without any mention of the original paragraph breaks by the 
author. Two independent assessors, A and B, were then asked to partition the 
text into paragraphs. They had no knowledge of the “true” partition, but were 
given as guideline that the average paragraph consisted of 4 to 5 sentences. 

The true partition, as well as those produced by A and B were then trans- 
lated into bitmaps, each consisting of one bit-site per sentence, with a 1-bit 
representing a sentence which starts a new paragraph. Figure 1 shows two pairs 
of such bitmaps, the upper pair corresponding to A and the lower to B; in each 
pair, the upper map is the one corresponding to the original partition and the 
lower map is the one produced by the assessor. 

These bitmap pairs were then presented to a group of 102 evaluators, which 
were asked to “grade” the similitude of the maps in each of the pairs. It has 
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1010000100100000100100101000100000010100000100000011000000100100100100100000100100000000000000100000 

1000010000100010000100100100100010000100000100100011001000100010000100100000100100001000010000001010 

1010000100100000100100101000100000010100000100000011000000100100100100100000100100000000000000100000 

1010001000100000100100000100100100010100000010100010100100000100000100100100100010000000100100010010 



Figure 1: Partitions of the 100 first sentences of Gadsby into paragraphs 



been emphasized that the grading should be done on intuitive grounds, and 
that there is no correct grade which they ought to match. Their challenge was 
rather to try to quantize their overall feeling of how close the bitmaps in each 
of the pairs look to them by assigning to each pair a number between 0 and 10, 
with 0 standing for a perfect match and 10 denoting that there is no connection 
whatsoever between the maps within the pair. Table 1 summarizes the results. 



Table 1: Evaluation of closeness of bitmap pairs 



Bitmap 


Hamming 


Fuzzy 


Average 


pair 


distance 


distance 


grade 


A 


19 


6.82 


3.73 


B 


22 


6.61 


3.55 



We see that when closeness is measured by the strict Hamming distance, the 
guess of assessor A was closer to the original than that of assessor B. Neverthe- 
less, most of the evaluators gave a better grade to the latter pair, and so did the 
fuzzy Hamming distance, which has been applied with parameters cj = cp = 1 
and A = 0.15. 

The second test relates to term clustering. The experiment was run on the 
King James version of the English Bible, which contains 10,644 different terms. 
The Bible consists of M = 929 chapters, each of which was taken as a textual 
unit. However, most of the terms occur only rarely, many in just one or two 
chapters, and these had to be excluded from our clustering tests. On the other 
end of the spectrum, some words are so frequent that they appear in practically 
every chapter; these too are not interesting from the clustering point of view. 
We thus restricted our attention to those terms appearing in JV chapters, where 
20 < IV < 500. The number of terms satisfying these constraints was n = 1462. 
For each of these, a bitmap of M bits was generated, with bit i of map j being 
set to 1 if and only if term j appeared in chapter i of the Bible. 

In preparation for the clustering, the crisp Hamming distance was evaluated 
for all the possible n{n — l)/2 = 1,067,991 pairs, and arranged in an upper- 
triangular matrix; their values ranged from 7 (for the terms Asher and Issachar, 
which appeared in 28 and 29 chapters respectively) to 550 (for the terms soul 
and came). 

The clustering was performed by using a traditional agglomerative method 
and the stop condition we chose for the clustering process was to reduce the 
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number of clusters by half. Since each iteration decreases this number by one, 
we had 731 iterations. The test was then repeated with the fuzzy instead of 
the crisp distance, and some striking differences in the contents of the formed 
clusters were found. 

For example, one of the clusters that emerged in the process using the crisp 
measure was 



(Joseph (levi ((gad reuben) simeon))), 

where parentheses are used to indicate the hierarchical structure of the cluster. 
Note that all the terms here are names of tribes, and therefore clearly related. 
But there are still several tribe names missing. On the other hand, there was a 
similar, but larger, cluster when the process was based on the fuzzy distance: 

(manasseh ((gad (reuben simeon)) (naphtali 
((asher issachar) zebulun)))). 

A tribes cluster is natural, since the tribe names are often mentioned in the same 
contexts. However, due to the chronological nature of the biblical description, 
several consecutive chapters (chosen as textual units in the test) often share the 
same topic. The fuzzy Hamming distance overcomes such - from the content 
point of view, possibly erroneous - chapter boundaries, which is why it succeeds 
in detecting more hidden connections. 

Another noteworthy example is a cluster with terms connected to sacrifices. 
The crisp measure gave: 

(((blemish bullock) (kid lamb)) 

((flour mingled) (atonement ram))), 

and a similar cluster, albeit with different internal structure, was produced by 
the fuzzy distance: 

(((atonement (blemish bullock)) (lamb ram)) 
((ordinance plague) ((flour mingled) (kid unleavened)))). 

What is remarkable here is the appearance of the term unleavened. While most 
of its occurrences in the Bible are connected to the unleavened bread eaten on 
Passover, there is also a secondary connection to sacrifices which are accompa- 
nied by unleavened bread. But the term bread appears in many other contexts 
and does therefore not form a cluster with unleavened. Indeed, bread appears 
as a singleton in both clustering processes based on the crisp or fuzzy measures. 
Nevertheless, the fuzzy distance was able to detect the secondary connection. 

5 Conclusions 

The Hamming distance is a highly respected measure of the distance between 
equi-dimensional strings. But when the strings have a natural neighborhood 
property, as is typical of the bitmaps used in Information Retrieval, then the 
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rigidity of the Hamming distance diminishes its value. We defined here a gener- 
alization of the Hamming distance that relaxes the condition for elements in a 
bitmap to match, by assigning partial credit to pairs of bits that don’t exactly 
match, but are nonetheless close. We expect this measure to more accurately 
represent the distance between bitmaps subject to “noise.” In this paper, we 
introduced the measure and explored some of its properties; in particular we 
found conditions for its defining a metric. The resistance of the fuzzy Hamming 
distance to random fluctuation was then compared to that of the classical crisp 
Hamming distance by means of controlled simulations; both the simulations and 
the evaluation measures were guided by a clustering metaphor. 

The fundamental reason for defining the new measure is that, on its face, 
it more clearly captures the quality we are trying to evaluate. Our preliminary 
tests suggest that it may offer performance advantages as well in contexts in 
which the effects of randomness are a concern. 
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Abstract. The well-known periodicity lemma of Fine and Wilf states 
that if the word x of length n has periods p, q satisfying p + q — d < n, 
then X has also period d, where d — gcd(p, q). Here we study the case of 
long periods, namely p + q — d > n, for which we construct recursively a 
sequence of integers p — pi > p 2 >■■■ > Pj-i > 2, such that xi, up to a 
certain prefix of xi, has these numbers as periods. We further compute 
the maximum alphabet size |A| = p + q — nofA over which a word with 
long periods can exist, and compute the subword complexity of x over A. 



1 Introduction 

We consider words over a finite alphabet, not necessarily binary. 

Definition 1. Let a; be a word of length |a;| = n having integer periods p and 
q. Throughout we put d =gcd{p, q), k = \q/p\^ and assume 

0 < p < q < n. (1) 

In particular, we disregard the period of length n of a;. If 

p + q — d > n, (2) 

then p and q are called long periods of x. If p, q, d satisfy p + q — d < n, then p 
and q are called short periods of x. 

Note that in view of the natural requirement (1), long periods aren’t all that 
long compared to short periods! 

The following Periodicity Lemma of Fine and Wilf applies to words with 
short periods. 
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Lemma 1. If x has short periods p, q, then x has also period d. 

Many words have both short and long periods. 

Example 1. The periods < n = % oi x\ = 110111011 are 4, 7, 8. Of these, (p, q) = 
(4, 8) are short (Lemma 1 gives no new information for this case); and (4, 7) and 
(7, 8) are long. 

We wish to describe the structure of a word x in terms of its long periods. 
Part of our discussion involves an inductive argument relating to two types of 
long periods which we call Type I and Type II. We’ll see that if x has Type II 
periods, then it can be described in terms of a prefix of x with long periods, 
and that prefix, possibly, in terms of a still shorter prefix of x with long periods. 
The process continues while we deal with prefixes with Type II periods, and 
terminates when we get a prefix with Type I periods. 

2 Structure of Words with Long Periods 

For a word x of length n with long periods p < q, put 

Pi =p, qi = q, di = d, ki = [qi/pi\, ni = n, xi = x. (3) 

For i > 2 apply Euclid’s algorithm for creating a simple continued fraction 
expansion of qj/pj, to construct recursively the following items. 

Pi = qt-i - ki-ipi-i, qi = pi-i, di = gcd{pi,qi), h = \qi/pi\, (4) 

Hi = Ui-i - ki-iPi-i, Xi = Xi[l.. .Ui], 

Definition 2. For z > 1, the long periods pi and qi of the word Xi are Type I if 

rii < {ki + l)pi, i.e., rzi+i < g^+i, (5) 

and Type II if 

m > {ki + l)pi, i.e., rzi+i > qi+i. (6) 

Example 2. For Example 1, the periods (pi,gi) = (4,7) are Type II, and (7,8) 

are Type I. 



Lemma 2. (i) We have di = d\ for all z > 2. Let z > I. If Pi, qi, Ui satisfy (I) 
and (2), then pi fqi {pi does not divide qi), pi > 3, and 

qi-pi + l< kiPi <qi-l. (7) 

(ii) For all i > 1, pi + qi — Ui is a eonstant. 

(iii) Let z > 2. If pi-i, qi-i are Type II periods of Xi-i, then pi, qi, zzj satisfy 
(I) and (2). 

(iv) If pi,qi are long periods of x\, then there exists a smallest j G Z>o for 
which Uj < {kj + l)pj, i.e., Pj,qj are Type I periods of Xj. 

(v) For this smallest value of j we have Pj+i < nj+\ < qj+i. 
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Proof, (i) di = gcd{pi,qi) = gcd(gi_i - ki-ipi-i,pi-i) = gcd{pi-i,qi-i). This 
implies di = d\. We have (1) and (2) =» qi < Ui < pi + qi — di => di < pi 
Pi J(qi. Then (7) follows from the fact that qi/pi is not an integer. Also pi > 1. 
If Pi = 2, then di = 1, so (1) and (2) imply qi < rii < qi + 1, a, contradiction. 
Hence Pi> 3. 

(ii) By (3), (4), pi+i + qi+i - m+i = pi + qi~ m. 

(iii) From (4) and (6), n = rii-i — ki-iPi-i > Pi-i = qi, which is one part of 
(1). The hypothesis of (ii) implies that pi-i,qi-i satisfy (1) and (2), so (7) holds 
with i replaced by z — 1. Hence 1 < Pi = qi-i — ki-iPi-i < Pi-i = qi, which are 
the other two parts of (1). 

As we just remarked, pi-i,qi-i satisfy (2). Hence we have by (ii) and (i), 
Pi-i - qi-i - rii-i = pi + qi - m> di-i = di, which is (2). 

(iv) By (4), 0 < rzi < rzj_i and {ki + l)pi > 2 for all z. The well-ordering 
principle then implies that j with the desired property exists. 

(v) The right side is the right side of (5) and the left side follows from (1), 

(4). ■ 



Lemma 3. Definitions 1 and 2 are consistent. 



Proof. For Type I periods, (2) and (5) do not imply one another in either 
direction. For Type II periods, (2) and (6) imply kp < q — d — 2 q + 1 — p < 
q — d—2 (by(7)) d < p — 3. Now d = p — 2 => d\2 d G {1, 2} p G 
{3, 4}. Similarly, d = p — 1 p = 2. Also p = I is excluded since p J(q. Hence 
for a Type II period we need p>5orp = 4, d=I. We show now that this 
requirement is satisfied “automatically” . 

Let there be a word with long periods p = d < q and d= 2. Then q = At + 2, 
t G Z>o. By (1), (2), q < n < q + 2. Thus n = q + 1. Hence n = At + 3 < 
([g/4j + 1)4 = At + A, so {p,q) are Type I periods. We can see similarly that 
p G {2, 3} (whence d = 3) implies that {p, q) are Type I periods. 

Note that (1) and (6) do not imply one another in either direction. 

In conclusion, Definition 1 is consistent with a word with long periods to 
be either Type I or Type II; and Type I or Type II words are consistent with 
Definition 1 . ■ 



Lemma 4 . Let x\ he a word of length n\ with long periods qi > pi, k\ = 
Vli/Pi\ ■ Then for gi G {0, . . . , k\} we have, 

xi[i] = xi[i + gipi] ( 8 ) 

for i = 1, ... ,H2 = ni — k\pi . Moreover, the prefix X2 = [ 1 . . . 712] of xi has 

period P2 = qi — k\pi . 



Proof. Note that (1) and the right hand side of (7) imply the inequalities: 

U 2 = n\ — kipi > 2 (9) 




An Extension of the Periodicity Lemma to Longer Periods 



101 



and qi — kipi > 1. Since xi has period pi, (8) follows. For verifying the second 
statement we have to show: xi[i] = xi[i + P 2 ] for z = 1 , . . . , rii — k\pi — {q\ — 
kiPi) = rii — qi > 1 (by (1)). Indeed, since xi has period qi we have for z G 
[l,m - qi], 

xi[i] = xi[i + qi] = x[i+qi - kip{\ = x[i + p 2 ]j 
where the second equality follows from the pi-periodicity of a;i. I 

Corollary 1. (i) If Xj has Type I periods pj < qj, then Xj = (xj+iz)'^^Xj+i, 
where the border Xj+\ has period Pj+i, and z = Xj[nj+i + 1 . . .pj]. 

(ii) If x\ has Type II periods p\ < qi, then X 2 is a word with long periods p 2 and 
q 2 which satisfy (1), namely, zz2 > (72 > P2 > 0. 

Proof, (i) Inequalities (9) and (5) imply |a;j+i| > 0, \z\ > 0. Evidently \xj+\z\ = 
Pj. The result of the first part of (i) now follows from the structure of the word 
(8), which has period pj. 

(ii) Follows directly from Lemma 2(iii). ■ 

Example 3. Let x\ = 1101110111, with m = 10. The periods < 10 are 4,8,9, 
where (4,8) is short; (4,9), (8,9) are long, of Type I. Corollary l(i) for (4,9) 
(with k = 2) states that X 2 = 11, z = 01, so xi = (1101)^11, where X 2 has 
period p 2 = 1. Corollary l(ii) is illustrated, say, by x\ = (1102)^11, which has 
the same parameters, but a larger alphabet. With respect to {pi,q\) = (8,9), 
Corollary l(ii) implies that we can take x\ = (11023456)11. 

We give two more examples for illustrating Corollary l(ii). 

Example 4- xi = 11011101 (zzi = 8) has Type I periods pi = 4, = 7, so 

ki = 1, p2 = 3, ZZ2 = 4 and z is empty. By Corollary l(i), xi = (1101)^1101. By 
Corollary l(ii) also xi = (1201)^1201 has the same parameters. 

Example 5. x\ = 101101110110 (zzi = 12) has Type I periods pi = 1, qi = 10, 
so k\ = 1, p 2 = 3, ZZ2 = 5, a;2 = 10110 and z = 11. By Corollary l(i), x\ = 
((10110)(11))^10110. By Corollary l(ii), x\ = ((10210)(34))^10210 has the same 
parameters. 

Example 6. Referring to Example 2, Corollary l(iii) (for {pi,q\) = (4,7)) states 
that X 2 = 11011 of length zz2 = 5 has long periods (72 = 4 and p 2 = 3, which are, 
in fact. Type I periods. 

Corollary l(i) and (ii) describe the structure of words with Type 1 periods. 

Corollary 1 (iii) suggests an iterative procedure for expressing the structure of 
a;i in terms of the shorter of a sequence of long periods pi of its prefixes Xi. The 
recursion terminates when the smallest j is reached for which Uj < (fcj + l)Pj- 
That is, the recursion terminates when the triple {pj, qj, Uj) corresponds for the 
first time to Type I periods. 
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Lemma 5. Let s € {1, . . j — 1}, and let j > 2 be fixed. For gi = 0, . . 

(1 < ^ < j), put Eg = g.sP.s + . . . + gj-ipj-i. Then for i = 1, . . .,nj, Eg + i 
assumes all the integer values in the integer interval [l,ns]. 



Proof. Descent on s. For s = j — 1, Ej-i + i = i + gj-ipj-i. For fixed values 
of gj-i € {0, . . . , kj-i}, we let i range over [1, nj]. This produces the intervals 
Iq = [l,nj],Ii = [l+pj_i,nj +pj-i], . . = [I + kj-iPj-i,Uj + kj-ipj-i\. 

Since Uj + kj-iPj-i = Uj-i, the first interval begins with 1 and the last ends with 
Uj-i- Moreover, every two consecutive intervals overlap: It ends with Uj +tpj-i, 
and It+i begins with 1 + (t + and we have 1 + (t + < nj + tpj-i = 

Uj-i — kj-ipj-i + tpj-i by (6). Thus the union of these intervals is [1, nj-i], as 
required. 

Suppose that we have already showed that for s > 1, ifs + z assumes all the 
values in [1, rzs]. Now Eg^i+i = z+gs_ips_i + . . .+gj_ipj_i = i+gg^ipg^i+Eg. 
For fixed values of gg-i G {0, . . . , fcs_i}, we let z range again over [1, nj]. This 
produces the intervals [1 + ifs, Uj + Eg], [l + ps_i +Eg,Uj+pg-i + if^], . . . , [1 + 
kg-ipg-i + Eg,Uj + kg-ipg-i + Eg], Let / denote the union of these intervals. 
By the induction hypothesis, 

I = [1, Ug] U [1 + pg-i,Ug + pg-i] U . . . U [1 + kg-ipg-i,Ug + kg-ipg-i]. 

Since Ug + kg-ipg-i = Ug-i, the first of the intervals begins with 1 and the 
last ends with Ug-i. Moreover, every two consecutive intervals overlap: 1 + (t + 
l)pg-iPg-i < Ug + tpg-i = Ug-i-kg-ipg-i + tpg^i by (6), proving the assertion. 



Corollary 2. (i) nj + k\pi + . . . + kj-iPj-i = n\, 

(ii) Uj + k2P2 + . . . + kj-iPj-i = ZZ2, 

(iii) Fori G [l,ni], gi G {0, . . . , kg} {I < £ < j), Ei+i = i+gipi + . . .+gj_ipj_i 
assumes all the values in [l,zzi]. 



Proof. By ( 3 ), ( 4 ), Uj + k\pi + . . . + kj-iPj-i = nj-i + k\pi + . . . + kj-2Pj-2 = 
. . . = ZZ2 + kipi = zzi, proving the first identity. The second identity is proved 
similarly. The third part is the case s = 1 of Lemma 5. I 

The following is our main result. 

Theorem 1. Let xi be a word with long periods p\ < q\; j as defined in 
Lemma 2(iii). Then the prefix Xj = xi[l . . .Uj] of x\ is a word with Type I 
periods pj, qj satisfying nj > qj > pj . For i = 1, . . . ,nj we have, 



xi[i] = xi[i + gipi + . . . + gj^ipj^i], 



(10) 



for all choices of gt, G {0, . . . , ki\, ^ = 1, . . . , j — 1. 
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Proof. Note that i + gipi + . . . + gj-ipj-i < rij + k\pi + . . . + kj-ipj-i = rii 
by Corollary 2(1), so (10) is well-defined. If x\ has Type II periods q\ > p\, then 
Corollary l(iii) implies that X 2 has long periods p 2 , Q 2 satisfying ri 2 > 92 > P 2 > 
0. We proceed by induction on j. For j = 2 (i.e., the periods P 2 ,Q 2 of X 2 are 
Type I), (8) is (10) and we are done. 

For j > 2 we may apply the induction hypothesis to X 2 ^ to conclude that for 

i = 1, . . . , Tij, 



xi[i] = xi[i + g2P2 + ■ ■ ■ + gj-iPj-i]. (11) 

We have i + g 2 P 2 -I- ... -I- gj-iPj-i < nj + k 2 P 2 -I- ... -I- = ri 2 by Corol- 

lary 2 (ii), so ( 11 ) is well-defined. 

Since xi has period pi, we can add gipi to i on both sides of (11), to get: 
x[i + gipi] = x[i + gipi + g2P2 + ■ ■ . + gj-iPj-i]. 

Relation (10) now follows from this and from ( 8 ). ■ 

Example 1. Let x\ = 1101110110111011 of length m = 16. It has periods 
7, 11, 14, 15. Note that (7, 11) are Type II. Then {m,pi,qi, k\) = (16, 7, 11, 1). 
We have (n 2 ,_P 2 , 92 , ^ 2 ) = (9,4, 7,1), with (4,7) being Type II periods. Then 
(’t^SjPs, 93 , fcs) = (5,3,4, 1), where (3,4) are Type I. Thus j = 3, so by Theo- 
rem 1, x[i] = x[i + 7gi -I- 452 ], 91,92 € {0, 1} for i = 1, . . . , 5. 

3 Maximum Subword Complexity 

A word X with long periods p, 9 can exist over various alphabets A. In this 
section we determine, for any given x with long periods, the maximum alphabet 
size |A| such that x exists over A and every letter of A appears in x. An alphabet 
A which is maximum in this sense and every letter of A does appear in x will 
be called a proper alphabet with respect to p, q. We then compute the subword 
complexity of x over a proper alphabet with respect to p, q. 

Note that for different long periods of the same word x we will, in general, 
have a proper alphabet of different size, as well as a different subword complexity. 

First, given pj,qj,rij satisfying (1), (2) and (5), we construct a proper al- 
phabet A and Xj of size rij with Type I periods pj < qj over A. Let A = 
{oi, . . . , Opj+qj-nj} be an alphabet, where the Ui are its distinct letters. Put 
a;i[l . . .pj+i] = ai . By Lemma 2(v), pj+i < rij+i. Let xi[pj+i -I- z] = 

xi[i] for 1 < Uj+i — pj+i{= rij — qj). Then Xj+\ = xi[l . . .rij+i] is periodic 
with period Pj+i, consistent with Corollary l(i). Set a;i[nj+i -I- 1 . . .pj] = z = 

Opj+l + 1 ■ ■ .ap^+q^-uj]- 

Let Xj = (xj+iz)'^^ Xj+i- Since |a;j+iz| = pj, we have \xj\ = kjPj + Uj+i = 
Tij. Moreover, Xj is periodic with period pj. To show that Xj is a word with 
Type I periods pj < qj, it suffices to show that it has period qj. By the Pj+i~ 
periodicity of Xjj-i, a;i[z] = xi[i + pj+\] for z G [l,zzj+i — pj+i] = [I, rij — qj]. 
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Also xi[i+pj+i] = xi[i + qj —kjPj] = xi[i + qj] by the pj-periodicity of Xj. Thus 
= Xi[i + qj] for i G [1, rij — qj\. 

The alphabet A has size pj + qj — rij = pi + qi — ni by Lemma 2(ii). In view 
of the Pj -periodicity of Xj, A cannot be any larger. It is also proper. 

Secondly, given j > 2, pi, qi, rii for 1 < z < j, where pi, qi are Type II periods 
for 1 < z < j and pj , qj are Type I periods, we construct a proper alphabet B 
and word xi with Type II periods Pi < qi (1 < z < j) and Type I periods pj, qj 
over B. Since xi contains a Type I factor as a prefix by Theorem 1, \B\ < \A\. We 
will see that actually B = A. We apply the procedure (4) to get pi, qi, ki, rii, Xi 
for 1 < z < J, where j is as in Lemma 2(iv), and where pi < qi are Type II 
periods for 1 < z < j. Construct the first prefix Xj = a;i[I . . .zzj] of xi over A, as 
above. Longer prefixes are constructed iteratively by descent. Suppose that for 
some i G {2, . . ., j} we have already constructed xi = xi[\ . . .rii]. To construct 
Xi-i, put xi[i] = xi [z — p^_i] for i = rii + 1, . . . , zz^_i. Since the prefix Xi of Xi-i 
has period qg = pi-i by hypothesis, the entire factor Xi-i has also period pi-i. 
We now show that it has also period qi-i. 

Let z G [l,zz^_i]. Since Xi-i has period Pi-i, we have xi[i + qt-i] = xi[z-|- 
qi-i - kj-ipj-i] = a;i[z-|-p^]. Now z-|-p^ < zz^_i - qi-i + pt = ni, so a;i[z-|-p^] 
for z G [1, rii-i] lies in Xi = x\[l . . .rii]. Since Xi has period pi by hypothesis, we 
get xi[i + qi-i] = xi[i], so Xi-\ has periods pi-i < qi-i as required. 

We have proved, constructively. 

Theorem 2. Any word xi of length n\ with long periods pi < qi can be realized 
over a proper alphabet A of size |A| = pi + qi — ni. 

The subword complexity of a word xi with long periods pi < qi over a proper 
alphabet A will be called maximum subword complexity with respect to pi, qi. 
We shall now compute the maximum subword complexity. 

Theorem 3. Let xi of length n\ be a word with Type I periods pi, qi over a 
proper alphabet A with respect to pi,qi, and let C{m) be its maximum subword 
complexity function with respect to pi, qi. Then 

C{m) = Pi + qi — n\ + m — 1 for 1 < m < n\ — qi + 1, 

C{m) = Pi for ni — qi + 1 < m < ni — Pi + 1, 

C (m) = ni — m + 1 for ni — pi + 1 < m < ni, 

and ]A] = pi + qi — ni. 



Proof. Theorem 2 implies that ]A] = (7(1) = pi + qi — ni. We consider three 
cases. 

(I) 1 < m < ni — qi. Induction on m. We have just seen that the result 
holds for m = I. Assume it holds for m — I (2 < m < ni — qi + 1), i.e., 
C{m — 1) = Pi + qi — ni + m — 2. Thus there are pi + qi — ni + m — 2 distinct 
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factors of length m — 1 in xi. Since xi has period pi, each of these factors has a 
copy whose first letter is in a;i [1 . . .pi]. Each of these copies is a prefix of a factor 
of length m, so C{m) > pi + qi — rii + m — 2. If we had C'(m) = C{m — 1), then 
X\ would be periodic with period pi + qi—ni+m — 2 < pi — 1, contradicting the 
assumption of computing the subword complexity for a proper alphabet with 
respect to p\ < q\. Therefore C{m) > p\ + q\ — ni + m — 1. 

We now show that this is actually an equality. We do this by showing that 
every factor of length m has a copy in one of the p\ + q\ — ni + m — 1 factors 
which have their last letter in the interval I = [ni — p\ + l,q\ + m — 1]. 

Let a;i [^ . . . ^ + m — 1] be any factor of a;i of length m. Let i be the smallest 
integer > ni — pi + 1 such that xi[i . . A + m — 1] = a;i [^ . . .^ + m — 1]. Since 
x\ has period pi, such i does exist. We wish to show that i G I. Suppose not. 
Then i > qi + m, so i — qi > m > 2. By the gi-periodicity of x\ we then have 
xi[i . . A + m — 1] = xi[i — qi . . A + m — 1 — qi\. 

By (7), kipi < qi, hence a;i[z . . . z + m — 1] = xi[i — q\ + k\pi . . A + m — 1 — 
qi + kipi]. Now i — {qi — fcipi) < z, and i — {qi — fcipi) > qi + m— q\ + fcipi > 
kipi + 2 > zzi — Pi + 1, since pi, qi are Type I. Thus i — {qi — fcipi) is smaller 
than z, yet has the desired properties. This contradiction shows that i G I, so 
C{m) = Pi + qi — ni + m — 1. 

(II) ni—qi < m < ni —p\ + 1. From the case m = ni — qi + 1 we see that the 
Pi factors xi[l . . .ni — qi + 1], . . . , xi[pi . . . ni — qi+pi] of length ni—qi + 1 are all 
distinct. Therefore the same holds for the pi factors a;i [1 . . . m] , . . . , xi [pi . . . pi + 
m — 1] for every m satisfying pi + m — 1 < zzi. There can be no other factors of 
this length, since xi has period pi. 

(III) zzi — Pi + I < m < zzi. As in the previous case, the ni — m + 1 
factors a;[I . . .m], . . . , xi[ni — m + I . . .zzi] of length m are all distinct. Thus 
C{m) = ni — m + 1. ■ 
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Abstract. In 1995, Hannenhalli and Pevzner gave a first polynomial 
solution to the problem of Ending the minimum number of reversals 
needed to sort a signed permutation. Their solution, as well as subsequent 
ones, relies on many intermediary constructions, such as simulations with 
permutations on 2n elements, and manipulation of various graphs. 

Here we give the first completely elementary treatment of this problem. 
We characterize safe reversals and hurdles working directly on the origi- 
nal signed permutation. Moreover, our presentation leads to polynomial 
algorithms that can be efficiently implemented using bit-wise operations. 



1 Introduction 

In the last ten years, beginning with many papers have been devoted to the 
subject of computing the reversal distance between two permutations. A reversal 
p{i,j) transforms a permutation 

7T = ( 7Tl . . . TTi ... TTj . . . 7T^ ) 

to 7T = ( TTr . . . TTj . . . TT^ . . . 7T^ ). 

and the reversal distance between two permutations is the minimum number of 
reversals that transform one into the other. 

From a problem of unknown complexity, it graduated to an NP-Hard problem 
B, but an interesting variant was proven to be polynomial Q. In the signed 
version of the problem, each element of the permutation has a plus or minus 
sign, and a reversal p(z, j) transforms tt to: 

7T = ( 7Tl . . . TTj . . . TTij-l TTi . . . 7T^ ). 

Permutations, and their reversals, are useful tools in the comparative study 
of genomes. The genome of a species can be thought of as a set of ordered 
sequences of genes - the ordering devices being the chromosomes -, each gene 
having an orientation given by its location on the DNA double strand. Different 
species often share similar genes that were inherited from common ancestors. 
However, these genes have been shuffled by mutations that modified the content 
of chromosomes, the order of genes within a particular chromosome, and/or 
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the orientation of a gene. Comparing two sets of similar genes appearing along a 
chromosome in two different species yields two (signed) permutations. It is widely 
accepted that the reversal distance between these two permutations faithfully 
reflects the evolutionary distance between the two species. 

Computing the reversal distance of signed permutations is a delicate task 
since some reversals unexpectedly affect deep structures in permutations. In 
1995, Hannenhalli and Pevzner proposed the first polynomial algorithm to solve 
it developing along the way a theory of how and why some permutations 
were particularly resistant to sorting by reversals. It is of no surprise that the 
label fortress was assigned to specially acute cases. 

Hannenhalli and Pevzner relied on several intermediate constructions that 
have been simplified since but grasping all the details remains a challenge. 

All the criteria given for choosing a safe reversal involve the construction of an 
associate permutation on 2n points, and the analysis of cycles and/or connected 
component of graphs associated with this permutation. 

In this paper, we present a very elementary treatment of the sorting of the 
oriented eomponents of a permutation, together with an elementary definition of 
the concept of hurdle that further simplifies the definition given in Q. Our first 
algorithm is so simple that, for example, sorting a permutation of length 20, by 
hand, should be easy and straightforward. 

The next section presents the basic algorithms. Section 3 contains the nec- 
essary links to the Hannenhalli-Pevzner theory, and the proofs of the claims of 
the Section 2. Finally, in the last section, we discuss complexity issues, and we 
give a hit-vector implementation of the sorting algorithm that runs in 0{n^). 



2 Basic Sorting 

The problem of sorting by reversal a signed permutation tt is to And d(7r), its 
reversal distance from the identity permutation (-1-1 -|- 2 . . . -|- n). As usual, 
we will frame a permutation tt = (tti 7T2 . . . 7t„) with 0 and n -I- 1, yielding the 
permutation: (0 tti 7T2 . . . 7t„ n -|- 1). 

Given a signed permutation tt = (0 tti 7T2 . . . 7t„ n -\- 1), an oriented pair 
(TTijTTj) is a pair of adjacent integers, that is |7Ti| — |7Tj| = ±1, with opposite 
signs. For example, the oriented pairs of the permutation: 

( 0 -k3 -kl -k6 -k5 -2 -h4 +7 ) 
are (-1-1,— 2) and (-1-3,— 2). 

Oriented pairs are useful in the sense that they indicate reversals that create 
consecutive elements. For example, the pair (-1-1, —2) induces the reversal: 

( 0 -k3 -kl -k6 -k5 -2 -H4 +7 ) 

( 0 -k3 -kl -k2 -5 -6 -k4 -\-7 ) 



creating the consecutive sequence -1-1 -1-2. 
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In general, the reversal induced by an oriented pair (tt^, tt^) will be, 

p{i,j — 1), if TTi + TTj = +1, and 
p(z+l,j), if TTi + TTj = -1. 

Note that reversals that create consecutive pairs of integer are always induced 
by oriented pairs. Such a reversal is called an oriented reversal. We define the 
score of an (oriented) reversal as the number of oriented pairs in the resulting 
permutation. For example, the score of the reversal: 

( 0 +3 +1 +6 +5 -2 +4 +7 ) 

(0-5-6 -1 -3 -2 +4 +7 ) 

is 4, since the resulting permutation has 4 oriented pairs. Computing the score 
of a reversal is tedious but elementary, and we will discuss efficient algorithms 
to do so in Section 4. The fact that oriented reversals have a beneficial effect on 
the ordering of a permutation suggests a first sorting strategy: 



Algorithm 1. As long as tt has an oriented pair, choose the oriented reversal 
that has maximal score. 

For example, the two oriented pairs of the permutation: 

( 0 +3 +1 +6 +5 -2 +4 +7 ) 

are (+1,-2), (+3,-2), and their score are respectively 2 and 4. So we choose 
the reversal induced by (+3, —2), yielding the new permutation: 

(0-5-6 -1 -3 -2 +4 +7 ) . 

This permutation has now four oriented pairs (0,-1), (—3, +4), (—5, +4) and 
(—6, +7), all of which have score 2, except (—3, +4). Acting on this pair yields: 

(0-5 -6^+2 +3 +4+7). 

which has four oriented pairs. Note here that the score of the pair (0, —1) is 0. The 
corresponding oriented reversal would produce a permutation with no oriented 
pair, and the algorithm would stop, in this case with an unsorted permutation. 
Fortunately, the pair (—1, +2) has a positive - and maximal - score, and we get, 
in a similar way, the last two necessary reversals to sort the permutation: 

( 0 -5 -6 +1 +2 +3 +4 +7 ) 

(0-5-4 -3 -2 -1 +6 +7 ) 



(0+1+2 +3 +4 +5 +6 +7 ) 
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Interestingly enough, this elementary strategy is sufficient to optimally sort 
most random permutations and almost all permutations that arise from biologi- 
cal data. The strategy is also optimal, and we will prove in the next section the 
following claim. 



Claim 1: If the strategy of Algorithm 1 applies k reversals to a permutation tt, 
yielding a permutation tt', then d(7r) = c?(7r') -|- k. 

The output of Algorithm 1 will be a permutation of positive elements. Most 
reversal applied to such permutations will create oriented pairs, but the choice 
of an optimal reversal is delicate. We discuss this problem in the next paragraph. 

2.1 Sorting Positive Permutations 

Let 7T be a signed permutation with only positive elements, and assume that tt 
is reduced, that is tt does not contain consecutive elements. Suppose also that tt 
is framed by 0 and n -I- 1 and consider, as in Q, the circular order induced by 
setting 0 to be the successor oi n + 1. 

Define a framed interval in tt as an interval of the form: 

i 7Tj_|_l 7Tj_|_2 . . .TTj^k-l i + k 

such that all integers between i and i + k belong to the interval [i . . .i + k]. For 
example, consider the permutation: 

( 0 2 5 4 3 6 1 7 ). 

The whole permutation is a framed interval by construction. But we have also 
the interval: 2 5 4 3 6, which can be reordered as 2 3 4 5 6, and, by circularity, the 
interval 6 1 7 0 2, which can be reordered as 6 7 0 1 2, since 0 is the successor of 7. 



Definition 1. If tt is reduced, a hurdle in tt is a framed interval that properly 
contains no framed interval. 



Claim 2: Hurdles as defined in Definition 1 are the same hurdles that are de- 
fined in Q and Q. 

When a permutation has only one or two hurdles, one reversal is sufficient to 
create enough oriented pairs in order to completely sort the permutation with 
Algorithm 1. Two operations are introduced in Q, the first one is hurdle cutting 
which consist in reversing one internal element, say of a hurdle: 



i 7rj_|_2 . . . i -\- k. 
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This reversal is sufficient to sort all the interval using Algorithm 1. For example, 
the following permutation contains only one hurdle: 

( 0 2 4 3 1 5 ). 

The reversal of element 2 cuts the hurdle, and the resulting permutation 

(0-24315) 

can be sorted with 4 reversals by Algorithm 1. 

The second operation is hurdle merging, which acts on the end points of two 
hurdles: 

i . . . i + k .. A' .. A' + k' 

and does the reversal p(z + k, i'). If a permutation has only two hurdles, merging 
them will produce a permutation that can be completely sorted by Algorithm 1. 
Thus, for example, merging the two hurdles in the permutation 

(02543617). 

yields the permutation: 

( 0 2 5 4 3 -6 1 7 ). 

which can be sorted in 5 reversal using Algorithm 1. 

Merging and cutting hurdles in a permutation that contains more than 2 
hurdles must be managed carefully. Indeed, cutting some hurdles can create new 
ones! 



Definition 2. A simple hurdle is a hurdle whose cutting decreases the number 
of hurdles. Hurdles that are not simple are called super hurdles. 

For example, the permutation (02543617) has two hurdles. Cutting the 
hurdle 2 5 4 3 6 yields the permutation, 

(02345617) 

which, by collapsing the sequence 2 3 4 5 6 is reduced to: 

( 0213 ), 

which has only one hurdle. However, the permutation (024351687 9) con- 
tains two hurdles, and if one cuts the hurdle 2 4 3 5, the resulting reduced 
permutation will be 

(0213546) 

which still has two hurdles. 

The following algorithm is adapted from Q, and is discussed originally in Q. 
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Algorithm 2. If a permutation has 2k hurdles, fc > 2, merge any two non- 
consecutive hurdles. If a permutation has 2fc+ 1, fc > 1, then if it has one simple 
hurdle, cut it; If it has none, merge two non-consecutive hurdles, or consecutive 
ones if fc = 1. 

Together with Algorithm 1, Algorithm 2 can be used to optimally sort any 
signed permutation. This completes the first part of the paper, and, in the next 
section, we turn to the task of proving our various claims. 

3 Selected Results from the Hannenhalli-Pevzner Theory 

The exposition of the complete results of the Hannenhalli-Pevzner theory is 
beyond the scope of this paper, and the reader is referred to the original paper 
Q, or the book on computational molecular biology by Pevzner Q. Instead, we 
will show the soundness of our algorithms by directly using the overlap graph 
introduced in Q. 

The first step in the construction of the overlap graph is to simulate a signed 
permutation on n elements with an unsigned permutation on 2n elements. Each 
positive element x in the permutation is replaced by the sequence 2a; — 1 2a;, 
and each negative element —x by the sequence 2a; 2a; — 1. For example, the 
permutation: 

7T = ( 0 -1 +3 +5 +4 +6 -2 +7 ) 

becomes: 

7t' =( 0 2 1 5 6 9 10 7 8 11 12 4 3 13 ) 

Reversals p{i,j) of tt are simulated by unsigned reversals p(2i — 1, 2j) in tt'. 

The overlap graph associated with a permutation tt has n vertices labeled by 
(0, 1), (2, 3), . . ., (2n, 2n + 1), with an edge between two vertices (a, b) and (c, d) 
iff, in the unsigned permutation, the interval corresponding to the positions of 
a and b overlaps - without proper containment - the interval corresponding to 
the positions of b and d. For example, if one draws arcs joining the end points 
of the pairs (0, 1), (2, 3), ... , (2n, 2n + 1) in the permutation tt': 




0 2 1 5 6 9 10 7 8 11 12 4 3 13 

0 -1 +3 +5 +4 +6 -2 +7 

The overlap graph can then be easily drawn by tracing an edge for each inter- 
secting arcs in the above diagram, yielding: 

There is a natural bijection between the vertices of the overlap graph and 
pairs of adjacent integers (tt^, tt^). in the original permutation. Indeed, a pair of 
adjacent integers will generate four consecutive integers in the unsigned permu- 
tation: 2x — 1, 2x, 2x + 1, and 2x + 2. The vertex {2x, 2x+ 1) is associated with 
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( 0 , 1 ) 




the pair (tt^, tt^). For example, the oriented pair (3, —2) in tt corresponds to the 
vertex (4, 5) in the overlap graph. Vertices corresponding to oriented pairs are 
naturally called oriented vertices, and are denoted by solid dots in the overlap 
graph. Moreover, we will refer to the reversal induced by a vertex meaning the 
reversal induced by the oriented pair corresponding to the vertex. The follow- 
ing facts, mostly from Q, pinpoint the important relations between a signed 
permutation and its overlap graph. 

Fact 1: A vertex has an odd degree iff it is oriented. 

Proof. Let 2x — 1, 2x, 2x + 1, and 2x + 2, be the four integers associated with 
the oriented pair ( 71 ^, 71 ^). Since tt^ and tt^ have different signs, the positions of 
2x and 2a; -I- 1 will not have the same parity in the unsigned permutations. Thus, 
the interval between 2a; and 2a; -I- 1 has an odd length, implying that it overlaps 
an odd number of other intervals. On the other hand, any interval that overlaps 
an odd number of intervals must have an odd length. Therefore the positions of 
its end points must have different parities, implying that the corresponding pair 
of adjacent integers is oriented. ■ 

Fact 2: If one performs the reversal corresponding to an oriented vertex v, the 
effect on the overlap graph will be to complement the subgraph of v and its adja- 
cent vertices. 

Proof. The reversal corresponding to an oriented vertex v has the effect of col- 
lapsing the associated interval, thus v will become isolated. Let u and w be two 
intervals overlapping v, meaning that exactly one of their end points lies in the 
interval spanned by v. The reversal induced by v will reverse these two points. 
Here, a picture is worth a thousand words: 

■ 

Fact 3: If one performs the reversal corresponding to an oriented vertex v, each 
vertex adjacent to v will change its orientation. 

Proof. Since v is oriented, it has an odd number 2fc -I- 1 of adjacent vertices. 
Let w be a vertex adjacent to v, with j neighbors also adjacent to v. With the 
reversal, w will loose j T 1 neighbors, and gain 2k — j new ones. Thus the degree 
of w will change by 2k — 2j — 1, changing its orientation. ■ 
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Reversed interval 



Fact 4: The score of the oriented reversal corresponding to an oriented vertex v 
is given by: 

T+U-0-1 

where T is the total number of oriented vertices in the graph, U is the number 
of unoriented vertices adjacent to v, and O is the number of oriented vertices 
adjacent to v. 

Proof. This follows trivially from the preceding facts. ■ 

We now state a basic result that is proven, in different ways, both in Q and 
Q. Define an oriented component of the overlap graph as a connected component 
that contains at least one oriented vertex. A safe reversal is a reversal that does 
not create new unoriented components, except for isolated vertices. 

Proposition 1 (Hannenhalli and Pevzner). Any sequence of oriented safe 
reversals is optimal. 

The difficulties in sorting oriented components lie in the detection of safe 
reversals. Hannenhalli and Pevzner deal with the problem by computing several 
statistics on cycles and breakpoints of various graphs. Kaplan et al. solve it by 
searching for particular cliques in the overlap graph. The next theorem argues 
that the elementary strategy of choosing the reversal with maximal score is 
optimal, thus proving Claim 1. 

Theorem 1. An oriented reversal of maximal score is safe. 

Proof. Suppose that vertex v has maximal score, and that the reversal induced 
by v creates a new unoriented component C containing more than one vertex. 
At least one of the vertices in C must have been adjacent to v, since the only 
edges affected by the reversal are those between vertices adjacent to v. So, let w 
be a vertex formerly adjacent to v and contained in C, and consider the scores 
of v and w: 

score{v) = T+ U — O— 1 
score{w) = T + U' — O' — 1 

All unoriented vertices adjacent to v must be adjacent to w. Indeed, an 
unoriented vertex adjacent to v and not to w will become oriented, and connected 
to w, contrary to the assumption that C is unoriented. Thus, U' > U. 
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All oriented vertices adjacent to w must be adjacent to u. If this was not the 
case, an oriented vertex adjacent to w but not to v would remain oriented, again 
contradicting the fact that C is unoriented. Thus, O' < O. 

Now, if both O' — O and U' = U, vertices v and w have the same set of 
adjacent vertices, and complementing the subgraph of v and its adjacent vertices 
will isolate both v and w. Therefore, we must have that score(w) > score(v), 
which is a contradiction. ■ 

3.1 Hurdles 

In this section, we assume that tt is a positive and reduced permutation. These 
assumptions are equivalent to say that the overlap graph has no oriented com- 
ponents - all of which can be cleared by Algorithm 1 -, and no isolated vertices. 

Consider again the circular order, this time on the interval [0..2n— 1], induced 
by setting 0 to be the successor of 2n — 1. The span of a set of vertices X in the 
overlap graph is the minimum interval that contains, in the circular order, all 
the intervals of vertices in X. For example, the three connected components of 
the following overlap graph have spans [4, 15] = [4 7 8 11 12 9 10 13 14 5 6 15], 
[8, 13] = [8 11 12 9 10 13], and [16, 3] = [16 1 2 17 0 3]. 





Hurdles are defined in Q as unoriented components which are minimal with 
respect to span inclusion. Moreover, in Q, it is shown that the span of a con- 
nected component is always of the form [2i, 2j — 1]. The following Lemmas and 
Theorem detail the relationships between connected components and framed 
intervals, substantiating the second claim of Section 1. 

Lemma 1. Framed intervals of the form [i, j] in a permutation on n elements 
are in one-to-one correspondence with framed intervals of the form [2i, 2j — 1] 
in the corresponding unsigned permutation on 2n elements. 
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Proof. The end points i and j of a framed interval [i, j] will be mapped, respec- 
tively, to the pairs 2i — 1, 2i, and 2j — 1, 2j. All integers between i and j appear 
in the interval [i,j], if and only if all the integers between 2i and 2j — 1 appear 
in the interval [2i, 2j — 1]. 

Lemma 2. Any framed interval [2i, 2j — 1] is the span of a union of connected 
components. 

Proof. If [2i, 2j — 1] is a framed interval, it contains exactly the integers be- 
tween 2i and 2j — 1, thus the only arcs in this interval are: (2i, 2i + 1), (2z + 
2, 2z -h 3), . . . , (2j — 2, 2j — 1), and no other arc intersects this set. Therefore, the 
corresponding set of vertices is not connected to any other vertex. ■ 

Lemma 3. The span [2i, 2j — 1] of a connected component is always a framed 
interval. 

Proof. If vertex {2i,2i + I) is connected to (2j — 2,2j — I), there must be a 
sequence of intersecting arcs linking 2i to 2j — 1. 




2 ' 2/+1 



2j-2 27-1 



Any arc with only one end point between 2i and 2j — 1 would therefore 
intersect one of the arcs in the sequence, so there are none. Thus, if integer 2k 
is in the interval, then 2fc-|- 1 is also in the interval, and if z < fc < j, then 2fc-|-2 
is also in the interval. ■ 

Theorem 2. If tt is reduced, an unoriented component is minimal iff its span 
is a framed interval that contains no other. 

Proof. By Lemma 3, the span of a connected component is always a framed 
interval. If the component is minimal with respect to span inclusion, by Lemma 
2, its span cannot contain properly another framed interval. 

On the other hand, a framed interval [2z, 2j — 1] that contains no other yields a 
single connected component C whose vertices endpoints are exactly the integers 
between 2z and 2j — 1. Thus the vertices of C are consecutive on the circle, and 
component C is minimal. ■ 

The main consequence of Theorem 2 is to give an elementary characterization 
of the concept of hurdles, that does not need the construction of the overlap 
graph. 



4 Settling Scores 

We now turn to the problem of computing the score of a reversal. In the following, 
we assume that the overlap graph of a permutation is explicitly represented as 
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a bit matrix, and consider, for a vertex v, the bit vector v whose coordinate 
is 1 iff vertex v is adjacent to vertex w. 

We also assume that the score and parity of each vertex are stored in the 
vectors s and p, respectively. It is only necessary to keep track of the variable 
part of the score that is, for both oriented and unoriented vertices, the expression 
Uy — Oy, where Uy is the number of unoriented vertices adjacent to v, and Oy 
is the number of oriented vertices adjacent to v. 

As an example, consider the permutation 

(0+3+1 +6 +5 -2 +4 +7 ) 

its overlap graph, and its associated data structure. 






(0,1) 


(2,3) 


(4,5) 


(6,7) 


(8,9) 


(10,11) 
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2 


0 


2 


0 



Clearly, initializing such a structure requires 0{n^) steps. Given it, finding 
the reversal with maximal score is trivial. The interesting part is the effect of a 
reversal on the structure. 

Suppose that we choose to perform the reversal corresponding to the oriented 
vertex v. Since v will become unoriented and isolated, each vertex incident to v 
will automatically gain a point of score. Thus we first set: 

s ^ s + 

Next, if w is a vertex incident to v, the vector w is changed according to: 



w 



w (B V 
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where 0 is the exclusive or bit wise operation. Indeed, vertices adjacent to w 
after the reversal are either existing vertices that were not adjacent to v, or 
vertices that were adjacent to v but not to w. The exceptions to this rule are v 
and w themselves, and this problem can easily be solved by setting the diagonal 
bit Vy to 1, and Wyy to 1 before computing the direct sum. 

Now, if w is unoriented, each of its former adjacent vertices will loose one 
point of score, since w will become oriented, and each of its new adjacent vertices 
will loose one point of score. Note that a vertex that stays connected to w 
will loose a total of two points. We can thus write the effect of the change of 
orientation of w on the vector s of scores as: 

s ^ s — w 

Wyy ^ I 

w ^ w (B V 
s ^ s — w 

where the subtractions are performed component wise on the vector of scores. If 
w is oriented, the losses are converted to gains, so the subtractions are converted 
to additions. 

Finally, the parity vector p is updated by the equation: 

p <— p®v . 

The algorithm requires eventually 0{n) vector operations for each reversal, 
and these operations can be implemented very efficiently using bit operations 
widely available in processors. 

Kaplan et al., in P, give an algorithm based on ^ that clears the hurdles 
from a permutation in less than 0{n?), and that can be used in conjunction with 
the above algorithm. But since we already have an extensive representation on 
the overlap graph, we can use it to keep track of the connected components, and 
to detect hurdles. 
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Abstract. We present a solution for the following problem. Given two 
sequences X = X 1 X 2 ■ ■ ■ Xn and Y = yij /2 • • • ym, n < m, find the best 
scoring alignment of X' — X^[i] vs Y over all possible pairs for 

k = 1,2,... and 1 < i < n, where X[i] is the cyclic permutation of X, 
is the concatenation of k complete copies of X[i] {k tandem copies), 
and the alignment must include all of Y and all of X' . Our algorithm 
allows any alignment scoring scheme with additive gap costs and runs 
in time 0(nm log n). We have used it to identify related tandem repeats 
in the C. elegans genome as part of the development of a multi-genome 
database of tandem repeats. 



1 Introduction 

1.1 Problem Description 

The problem we solve is the following: 

Tandem Cyclic Alignment 

Given: Two sequences X = X 1 X 2 ■ ■ - Xn and Y = yiy 2 ■ ■ - ym, n < m and an 
alignment scoring scheme with additive gap costs. 

Find: The best scoring alignment of X' = X^\i] vs Y over all possible pairs 
(fc, i), for fc = 1, 2, . . . and 1 < * < n, where X[i\ is the cyclic permutation of 

— XiXi-\-i * * * Xn-Xl ' ' ' Xi—i^ 

is the concatenation of k complete copies of X\i] {k tandem copies), 
and the alignment must include all of Y and all of X' . 

Let X and Y be two strings over an alphabet E. An alignment of X and Y 
(see section 3.1 for an example) is a pair of equal length sequences X, Y over the 
alphabet EU {— } where — is a gap character and X, Y are obtained from X, Y 
by removing the gap characters. An alignment can be interpreted as a sequence 
Q of edit operations | that transform X into Y. The allowed operations are 1) 
insert a symbol into X, 2) delete a symbol in X and 3) replace a symbol in X 
with a (possibly identical) symbol from E. A scoring scheme defines a weight 

* Partially supported by NSF grants CCR-9623532 and CCR-0073081. 
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for each possible operation and the alignment score is the sum of the weights 
assigned to the operations in Q. 

There are two widely used classes of scoring schemes, 1) distance scoring, in 
which identical replacement has weight = 0, all other operations have weight > 0 
and the best alignment has minimum score, and 2) similarity scoring, in which 
“good” replacements have weight > 0, all other operations have weight < 0 and 
the best alignment has maximum score. Within these classes, scoring schemes 
are further characterized by the treatment of gap costs. A gap is the result of the 
deletion of one or more consecutive characters in one of the sequences (insertion 
into the other sequence). Additive gap costs assign a constant weight to each 
of the consecutive characters. Other gap functions have been found useful for 
biological sequences, including affine gap costs {a + (3k for a gap of k consecutive 
characters where a and (3 are constants) and concave gap costs {a + (3f{k) where 
/() is a concave function such as square root). The solution in this paper assumes 
a scoring scheme with additive gap costs. For ease of discussion, we will, for the 
remainder of the paper, assume distance scoring although the results apply as 
well to similarity scoring. 

Our motivation for this problem arises from an ongoing effort to construct a 
multi-genome database of tandem repeats (TRDB). A central task is the cluster- 
ing of tandem repeats into families i.e. repeats that occur in different locations 
in a genome but have identical or very similar underlying patterns. Grouping 
these repeats will facilitate identification and study of their common properties. 
Tandem repeat families have been detected in both prokaryotes and eukaryotes, 
including the E. coli, S. cerevisiae, C. elegans and human genomes. 

Clustering requires an effective and consistent means of measuring the simi- 
larity or distance between repeats. Standard comparison methods are not easily 
applied to tandem repeats because they contain repetitive, approximate copies of 
an underlying pattern. In addition, comparison of related repeats often reveals 
a scrambling of the left to right order of the slightly different internal copies. 
An accurate comparison method should be insensitive to copy number and copy 
order and we have therefore chosen to abstract the repeats as either 1) consensus 
patterns or 2) profiles and then compare them using alignment. 

Because repeat copies are adjacent, the designation of first position in a 
consensus or profile is arbitrary. This is not just a theoretical abstraction, the 
number of copies in a repeat is often not a whole number and distinct repeats 
which are obviously similar often do not start and end at the same relative 
positions. Therefore, comparison must allow cyclic permutation of one pattern 
so that its first position can be arbitrarily aligned with any position in the other. 

Once families are constructed, we can determine interfamily evolutionary 
relationships by comparing patterns from different families. In particular, we can 
determine if one pattern consists of multiple approximate copies of the other, 
again with the property of cyclic permutation. It is this comparison that Tandem 
Cyclic Alignment addresses. 
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1.2 Background 

The Tandem Cyclic Alignment problem is a merger of two classes of pairwise 
alignment problems, 1) tandem alignment, in which one of the sequences con- 
sists of an indeterminate number of tandem copies of a pattern and 2) cyclic 
alignment, in which cyclic permutation of one of the patterns is allowed. Three 
related problems from these classes are: 



Pattern Local, Text Global Tandem Alignment. Given a pattern X, a 
text Y and a scoring scheme for alignment, find the best scoring alignment of 
X' = X^ [1] vs Y over all fc = 1,2,..., where all of Y must occur in the alignment, 
but where the part of X' aligned with Y need not contain a whole number of 
copies of Al[l]. The alignment, rather, may start and end on any index of X. 



Pattern and Text Global Tandem Alignment. Given a pattern X, a text 
Y, an index i, 1 < i < |AT|, and a scoring scheme for alignment, find the best 
scoring alignment of X' = X^\i] vs Y, over all k = 1,2,..., where all of Y and 
all of X' must occur in the alignment. 



Cyclic Global Alignment. Given sequences X and Y and a scoring scheme 
for alignment, find the best scoring alignment of X[i] vs Y over all possible i, 
1 < i < |A| where all of Y and exactly one whole (cyclically permuted) copy of 
X must occur in the alignment. 

The tandem alignment problems are both solved by wraparound dynamic 
programming (WDP) QQ in 0{mn) time when the scoring function has addi- 
tive or affine gap costs. The cyclic alignment problem can be solved naively in 
0{n^m) time by separately computing the alignment of X[i] vs Y for every value 
of i. Maes Q presented a 0(nm log n) time solution for scoring schemes with 
additive gap costs by observing that there exists a set of best scoring alignments, 
one for each 1 < i < n such that the alignments are pairwise non-crossing (be- 
low). Landau, Myers and Schmidt Q gave a, 0{n-\- km) algorithm for unit cost 
differences (edit distance) when the score of the best alignment is bounded by k. 
Their algorithm, although theoretically efficient, has a large constant factor and 
is difficult to implement because it requires constructing a suffix tree prepro- 
cessed for least common ancestor queries. Schmidt Q gave a rather complicated 
0{nm) algorithm for similarity scoring where each insertion/deletion character 
costs —s and match/mismatch weights are in the interval [—s,m] for fixed pos- 
itive integer values m and s. This method can not be used to compute general 
distance scores more efficiently than the Maes algorithm. 

It seems natural to adapt the Maes solution to our problem, except for one 
difficulty: in tandem cyclic alignment, there may be no set of best scoring align- 
ments which are all pairwise non-crossing. What this means is that the number 
of copies of X used in an alignment can vary depending on the starting position 
i. (For an example see Section 3.1). We show, though, that no alignment can 
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Fig. 1. (left) Alignments do not cross; (right) the Maes algorithm. 



cross the “same” alignment more than once. This leads to a 0{mn\ogn) time 
solution using adaptations of the Maes algorithm and WDP. 

The remainder of the paper is organized as follows. In section |we give brief 
descriptions of the non-crossing alignments property, the Maes algorithm and 
wraparound dynamic programming. In sectionHwe give the main theorem about 
crossing tandem cyclic alignments. In sectionflwe then apply this property to 
obtain our algorithm. Finally, in sectionHwe show an example from our analysis 
applied to tandem repeats from the C. elegans genome. 

2 Preliminaries 

2.1 Non-crossing Alignments 

When gap costs are additive, a simple non-crossing property of optimal paths in 
the two dimensional alignment matrix applies We present one variation 

appropriate for this paper. 



Definition. Two paths A and B in an alignment matrix cross if there exist two 
rows e and / such that in row e all matrix cells in path A are left of all cells in 
path B and in row / all cells in path B are left of all cells in path A. The paths 
share one or more common cells where they cross. 

Note that sharing cells is not the same as crossing. 



Property: Given an alignment matrix (see figure H l^ft) four cells q,r,s 

and t with q left of r in the top row and s left of t in the bottom row, for any 
optimal scoring path A from q to s, there exists an optimal scoring path B from 
r to t such that the two paths do not cross. 

Proof. By contradiction, suppose all optimal scoring paths from r to t cross A. 
Let B be one such path. A and B must cross an even number of times. Consider 
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the separate subpaths (labeled Ai and Bi in the figure) in which A is first right 
of B proceeding from the top row. 

Claim 1 : Cost of B 2 is equal or worse than cost of A 2 ■ Otherwise A is not optimal 
because joining subpaths Ai, B 2 and A 3 is better. 

Claim 2: Cost of B 2 is better than cost of A 2 . Otherwise, joining subpaths Bi, 
A 2 and B 3 gives a path with score no worse than B, but which does not cross 
A. Such a path was assumed not to exist. 

Clearly Claims 1 and 2 lead to a contradiction. 

2.2 The Maes Algorithm for Cyclic Global Alignment 

The Maes algorithm Q capitalizes on the non-crossing property to bound the 
area of the alignment matrix that must be computed for each index i in the 
alignment of X\i] vs Y . 

First the alignment of A[l] vs Y (call it Ai) is computed in 0{nm) time. A 
new matrix is then constructed which uses two concatenated copies of A vs T 
(figure 5 right). The alignment A\ shifted right (call it An+i) optimally aligns 
the second copy of X with Y. 

Ai and An+i bound any alignments which start and end between them. 
Specifically, they bound the alignment of A[n/2] vs Y (call it An/ 2 )- K is easy 
to see that this procedure can be followed recursively, for a logarithmic number 
of steps, subdividing X into halves, then fourths, etc. always at the midpoints 
between bounding alignments. In each step, the alignment score calculations in 
a matrix cell are computed once, except for matrix cells on a bounding path, 
where they are computed twice (once for the computation in the interval to 
the left and once to the right), yielding O(nmlogn) as the overall time of the 
algorithm. 

2.3 Wraparound Dynamic Programming (WDP) 

WDP models the similarity computation of Y with an unrestricted number 
of copies of X while using an alignment matrix of size nm rather than of size 
m?, i.e. using only one copy of X. WDP computes in matrix j] the optimal 
score that would be obtained by aligning Yi • • • with X*X\ ■ ■ ■ Xj, where X* 
indicates zero or more tandem copies of X. The correctness proof hinges on the 
observation that any optimal scoring alignment will not contain a single deletion 
of h > n characters of A { n = |A|). This is so because otherwise, another 
alignment exists, identical except for having a deletion of only h — n characters, 
and possesing a better score. Since WDP examines all alignments with deletions 
in A of size < n, it produces the optimal scoring alignment. 

The technique involves computing two passes through each row. In both 
passes, all cells but the first are treated normally. In the first pass, cell S'[i, 1] 
(corresponding to Yi and Ai) is given the better of 1) a value derived from the 
cell — 1, 1], the first cell in the row above (corresponding to a deletion of 
Yi) and 2) a value derived from cell S[i — l,n], the last cell in the row above 
(corresponding of a pairing of Yi and Ai). This later is a wraparound value. In 
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the second pass, S[i, 1] receives the maximum of 1) its current value, and 2) a 
value derived from S[i,n], the last cell in its row (corresponding to a deletion of 
Xi). This is also a wraparound value. 

3 Crossing Tandem Cyclic Alignments 

Here we show that although tandem cyclic alignments may cross, no alignment 
can cross the “same” alignment more than once. “Same” in this case means an 
alignment that has been shifted one or more full copies of the pattern left or 
right, similar to the shifting of the alignment Hi to become H„+i in the Maes 
algorithm. 

3.1 An Example 

Let the pattern X and text Y be 

X = gaccga Y = accgatacgagacccgagaacgagaccg . 

Then, using an edit distance scoring scheme, (match=0, mismatch, indel=l), 
the only best scoring alignment of A^[l] vs Y (with a score of 6) uses 5 copies 
of A[l]: 



* * 
gaccga gaccga ga-ccga gaccga gaccga 
-accga ta-cga gacccga gaacga gaccg- 

while the only best scoring alignment of X^ [4] vs Y (with a score of 8) uses 4 
copies of X[4]: 



* * 

— cgagac cgaga-c cgagac cgaga-c- 
accgata- cgagacc cgagaa cgagaccg 

Since the alignments use a different number of copies of A, they cross and 
there is no set of best scoring pairwise non-crossing alignments. 

3.2 No Alignment Crosses the “Same” Alignment more than Once 

Theorem 1. Given two sequences, X and Y and an index i, 1 < i < n, let Ci be 
the number of copies of X[i] in a best scoring alignment of X' = X^\i] vs Y over 
all k = 1,2,..., where all of Y and all of X' must be included in the alignment 
(i.e. Ci = k in that best scoring alignment) . Then, for any j, I < j < n there 
exists a best scoring alignment of X' = X^[j] vs Y over all h = 1,2,... such 
that Cj = h in that alignment and |cj — Cj| < 1. 

In other words, if a best scoring alignment of X^[i\ vs Y uses c copies of X[i\, 
then for any j, there is a best scoring alignment of X^[j] vs Y which uses one 
of {c — 1, c, c-l- 1} copies of X[j\. 
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Fig. 2. An illustration of Theorem 1. 



Proof. Assume that \ci — Cj\ > 1. We show a contradiction if Cj > Ci + 1. A 
similar argument holds for Cj < Ci — 1. Refer to figure^ Let Ai be a best scoring 
alignment of [i] vs Y and let R be a best scoring global alignment of [j] 
vs Y with smallest Cj and let Cj > Ci + 1. Let A 2 be a duplication of Ai shifted 
to the right by one copy of X[i] and let A 3 be the rightmost shifted copy crossed 
by B. (By assumption, A 2 and A 3 are distinct.) 

Let r and s' be the points, respectively, where B crosses A 2 and A 3 and let s 
correspond to the point on A 2 matching s' . Call x y the part of an alignment 
from point x to point y. Let s t be a duplication of s' w in B shifted to the 
left. Finally call cost{x >- y) the alignment score for x > y (and recall that we 
are assuming distance scoring so that smaller cost is better). 

Claim 1: cost{r >- s') > cost{r >- s). Otherwise, piece together q >- r, r >- s' and 
s' y V to get a better scoring alignment than A 2 . But, A 2 is optimal. 

Claim 2: cost{r >- s') < cost{r >- s). Otherwise, piece together p >- r, r s and 
s t to get an alignment with score no worse than B, and using less than Cj 
copies of X[j\. But B uses minimal copies. 

Claims 1 and 2 produce a contradiction. 

4 The Tandem Cyclic Alignment Algorithm 

The Tandem Cyclic Alignment problem is solved in three steps. Each step re- 
quires first finding a guide alignment and then implementing the Maes algorithm 
using the guide as alignment A\. Since we are using tandem copies of the pat- 
tern, the Maes algorithm will be implemented as Bounded Wraparound Dynamic 
Programming (BWDP) which is described following the outline of the main al- 
gorithm: 

Step 1: Use pattern and text global WDP (section 1.2) to find the best scoring 
alignment of [1] vs Y for k = 1,2,.... Call this alignment A. Let the num- 
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ber of copies of X used in A be c. Use A as Ai in the BWDP version of the 
Maes algorithm to find the remaining best scoring non-crossing alignments 
for X'^\i] vs Y for i = 2, . . . ,n. Call the best scoring alignment from this step 
Be- Save Be- 

Call c* the number of copies in the solution to the tandem cyclic alignment 
problem. At this point, we have saved the best from the set of cyclic alignments 
each of which uses c copies of X, but we do not know if c = c*. However, by 
Theorem 1, we know that c* S {c — 1, c, c -I- 1}. 

Step 2: Using A (from step 1) and a copy of A shifted to the right one pattern 
length, find the best scoring alignment of [1] vs Y using BWDP. Call 
this alignment A+ (figure H. Use A+ as Ai in the BWDP version of the 
Maes algorithm to find the remaining best scoring, non-crossing alignments 
for [i] vs Y for i = 2, . . . , n. Call the best scoring alignment from this 
step Bc+i- Save Bc+i- 

Step 3: Using A (again from step 1) and a copy of A shifted to the left, 
find the best scoring alignment of vs Y using BWDP. Call this 

alignment A~ . Use A~ as Ai in the BWDP version of the Maes algorithm 
to find the remaining best scoring, non-crossing alignments for X'^~^[i] vs Y 
for i = 2, . . . ,n. Call the best scoring alignment from this step Bc-i- Save 
Bc-i- 

Step 4: Choose the best scoring alignment from Bc,Bc+i and Bc-i- 



Time Complexity. Each of the three main steps starts with finding a guide 
alignment using WDP or BWDP in time 0{nm). Then each step finds the re- 
maining alignments using the BWDP version of the Maes algorithm in 
0{nmlogn) time. The total time is therefore 0{nmlogn). 



4.1 Bounded Wraparound Dynamic Programming (BWDP) 

BWDP is computed in an alignment matrix W[z, j] of size (m-|-l)(2n-|-l), i.e. it 
uses two copies of X. We are given two alignments L and R as boundaries. We 
assume that L and R are both alignments of X^[j] vs Y for a fixed c and different 
j and that neither crosses outside the pair of “master” bounding alignments 
X‘^[l] vs Y and its duplicate shifted right one copy of X (or alternately X^[l] vs 
Y and its duplicate shifted left). 

We use L and R to obtain, for each row i = 0, . . . , m in the matrix, the left- 
most, L[i], and the rightmost, R[i], boundary columns between which alignment 
scores will be computed. Finally, we are given an index k, L[0] < k < i?[0] as 
the starting column for the alignment. Figure H left side, shows the bounded 
computation as it would appear if we use an unrestricted number of copies of 
X. Note that for some i, L\i] may be left of the starting position k or R\i] may 
be right of the ending position k + cn. In this case, we contract the boundaries 
to the appropriate values. 
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X X X X 



Y 



X X X X 





Fig. 3. Finding the guide alignments: A+ (left) and A (right). BWDP achieves 
the same result using only one copy of X. 

X X X X X XX 

* k 





Fig. 4. Bounded wraparound dynamic programming simulates computation with 
an unrestricted number of copies of X. 



Figure^ right side, shows the same bounded computation, but this time, in 
an array which contains only two copies of X. When the boundaries exceed the 
first two copies of X, the computations wrap around. There is only one question 
which must be addressed to guarantee that the BWDP result is the same as the 
unrestricted-copies-of-X result. Can the boundaries collide or cross in the space 
of only two copies of X {i.e. can R catch up with L as they wrap around)? 



Definition: The width of the computation space is the maximum difference 
R[i] — L[i] + 1, z = 0, . . . , m in the unrestricted-copies-of-X computation. 

Lemma 2. The maximum width of the computation space is 2n. 

Proof. Note that all the boundaries must lie within the “master” boundaries 
so it suffices to show the maximum width for the masters. Since the master 
alignments are duplicates separated by one copy of X, corresponding positions 
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in the alignments are n columns apart, i.e. they occur in columns c and c + n 
(figure 5 . Consider a row i in which the alignment moves horizontally in the 
matrix (a deletion of characters in X). If L[i] is in column c, then R[i] is in 
column c+n+h where h is the length of the horizontal move. As stated previously 
(section 2.3), h < n — 1, so the maximum possible width of the computation is 
2n. 

Corollary 3. The bounding alignments can not collide in the BWDP array 
which has 2n columns (excluding column zero which is not used after the bound- 
aries wrap around). 

5 Application to Tandem Repeats from the C. Elegans 
Genome 

We implemented the tandem cyclic alignment algorithm and used it to analyze 
the consensus patterns of tandem repeats found in the C. elegans genome. Our 
goal was to identify pairs of patterns, one of which is a multiple approximate copy 
of the other. The individual repeats were obtained with the Tandem Repeats 
Finder (TRF) program | which identifies approximately 25,000 tandem repeats 
in C. elegans. From these, we selected nearly 5300 repeats in four groups with 
nominal pattern sizes of 70 base pairs (bp), 51bp, 35bp, and 17bp. (Repeats 
within a group had pattern sizes within 3bp of the nominal size.) Each repeat 
was paired with every repeat from all groups of smaller nominal pattern size 
(except 70 bp which was not paired with 51 bp and 51 bp which was not paired 
with 35 bp) and tandem cyclic alignment was run on all pairs. 

DNA consists of two strands, one of which is the reverse complement of the 
other. In a reverse complement, the direction of the sequence is reversed, the 



L[i] f^['} 




Fig. 5. The maximum width of the computation space is 
2n. 
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As and Ts are swapped and the Cs and Gs are swapped. Since similar repeats 
may have been found as reverse complements, for every pair, we first align the 
patterns as they appear and then reverse complement one pattern and align 
them again. 

The total number of alignments (including reverse complements) was 16.3 
million. On a 500 Mhz PC, the alignments took 17 and 3/4 hours or just over 
an hour per million alignments. 

Figure n illustrates an example of the relationships found in this search. It 
consists of the alignment of the consensus patterns from 3 repeats, from widely 
scattered genomic locations, that were found to be related. Pattern 1176 is 19 
bp long. Four copies are shown (indicated by alternate shading). Pattern 197 
is 34 bp long. Two copies are shown. Pattern 8989 is 68 bp long. One copy is 
shown. For the latter two patterns, only the differences with the top pattern are 
indicated. A dash (— ) means that there is no character that corresponds to the 
character in the top line. This is an insertion (into the top line) or a deletion 
(from the second or third line) . 

Notice first that pattern 8989 is almost identical to two copies of pattern 197, 
differing only in the substitution of A and C. TRF is able to find such closely 
related patterns (of different sizes) for the same repeat and in fact reported that 
repeat 8989 also had a pattern of size 34 that was identical to 197. 

Next compare patterns 1176 and 197. Pattern 197 consists of two copies of 
1176 with 6 differences. Because the two halves of 197 were quite different, TRF 
did not report a pattern of size 19 (or any other similar size) for repeat 197. The 
following scenario (highly speculative!) may have occurred. Two identical 19 bp 
copies existed in an ancestral repeat and one of those copies was extensively mu- 
tated, including the deletion of 4 nucloetides. The resulting pair of repeats, now 
34 bp long was subsequently transposed to another location in the genome where 
it duplicated, forming a tandem repeat. Some evidence for this scenario exists 
in one of the copies of repeat 1176 which contains the adjacent two nucleotide 
deletion seen in pattern 197. 

Without the tandem cyclic alignment algorithm, the relationship between 
patterns 197 and 1176 would be less clear. Our goal in comparing these patterns 
is to obtain an accurate measure of the distance between them. If we used the 
pattern local, text global tandem alignment algorithm (section 1.2), the results 
would depend on the presentation of the text i.e. the cyclic permutation in which 
it appears in the input. If pattern 197 (the text) were presented starting just 
after a deletion (at the AAT following the two nucleotide deletion for example), 
then the algorithm would fail to align any characters from pattern 1176 with the 
positions of the two deleted characters (which do not actually occur in pattern 
197). On the other hand, if pattern 197 were presented starting as it does in 
figureHat GCAA, then all the deleted characters will appear in the alignment. 
The alignment score will be different in these two cases. 

Adjustment of alignment score obtained by the pattern local, text global 
tandem alignment algorithm is possible, but not necessarily straightforward. As 
an illustration, consider the pattern local, text global alignment (left below) and 
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Fig. 6. An example of aligned consensus patterns of different sizes. 



the tandem cyclic alignment (right below) of text X and pattern Y : 

= gggtgg y = gggt- 

gggt gg gggt gg— 

gggt gg gggt gggt 

The former can be transformed into the later by merely adding the deleted 
characters and the cost for the gap. But, a different situation occurs if the last 
character of the text is changed to t: 

X = gggtgt. 

gggt gt gggt g— t 

gggt gg gggt gggt 

Now the score changes not only by the cost of a gap, but also by the loss of a 
mismatched pair and the gain of a matched pair. More complicated situations are 
not difficult to construct. The tandem cyclic alignment algorithm however always 
gives the correct score without manipulation regardless of the presentation of the 
pattern or text. 

6 Conclusion 

We have defined a new alignment problem, tandem cyclic alignment and provided 
an algorithm which solves this problem in 0(nm log n) time for two sequences 
of length n and m, n < m when using any alignment scoring scheme with 
additive gap costs. The algorithm was used to compare tandem repeats from 
the C. elegans genome in order to identify pairs of repeats with an evolutionary 
relationship where the consensus pattern of one is a multiple of the consensus 
pattern of the other. We showed an example of one such relationship which would 
not be reliably recognized with other alignment algorithms. 
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Abstract. Given an input sequence of data, a motif is a repeating pat- 
tern, possibly interspersed with “dont care” characters and a flexible 
motif could have a variable (as opposed to fixed) number of “dont care” 
characters. Given a sequence of records with F fields each, an association 
rule is a common set of / fields, f < F, with identical (or similar) re- 
peating values. The data in either case could be a sequence of characters 
or sets of characters or even real values. It is well known that the number 
of motifs or association rules, say N, could potentially be exponential in 
the size of the input sequence or number of records, say n. In this paper 
we present a new algorithm to discover all flexible motifs or association 
rules in the input. A novel feature of this algorithm is that its running 
time is linear in the size of the output (ignoring polylog factors). More 
precisely, the complexity of the algorithm is 0((n® -I- V)logn). This is 
the first algorithm for motif discovery with a proven output sensitive 
complexity bound. The discovery algorithm works in two phases: in the 
first phase it detects a linear number of core motifs in time polynomial 
in the input size n and in the second phase it detects all the remaining 
motifs N' in 0{N' logn) time. The core motifs of the first phase axe also 
characterized as being those of “highest specificity”: loosely speaking, 
a pattern with higher specificity has less “dont care” characters. Some 
applications (for instance the ones that require the study of those por- 
tions of the input sequence that contribute to the non-gapped regions 
of motifs ) require only the core motifs. Hence for such applications, the 
first phase of the algorithm suffices. However, the general problem is of 
use in motif discovery tasks in gene or protein sequences, or discovery of 
association rules from gene expression data or in data mining. 



1 Introduction 



Given a sequence of data, a “rigid” motif is a repeating pattern, possibly inter- 
spersed with dont-care characters, that has the same length in every occurrence 
in the input sequence. Pattern or motif discovery in data is widely used as a 
means o f “understanding” large volumes of data such as DNA or protein se- 
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Allowing the motifs to have a variable number of gaps (or “dont care” char- 
acters), termed flexible motifs, further increases the expressibility of the mo- 
tifs For example given a string s = abcdaXcdabbcd, 

m = a.cd is a rigid pattern that occurs twice in the data at positions 1 and 5 in 
s. In the above example, the flexible motif would occur three times at positions 
1, 5 and 9. At position 9 the dot character represents two gaps instead of one. 
The definition of the flexible pattern used in this paper is an extension of the 
generalized regular pattern described in in the sense that the discovery 

algorithm can also handle a sequence of real numbers. 



The task of discovering patterns must be clearly distinguished from that of 
matching a given pattern in a database. In the latter situation we know what 
we are looking for, while in the former we do not know what is being sought. 
Typically, the higher the self similarity in the sequence, greater is the number of 
patterns or motifs in the data. Motif discovery on such data, such as repeating 
DNA or protein sequences, is indeed a source of concern since these exhibit a very 
high degree of self-similarity (repeating patterns). The number of rigid motifs 
could potentially be exponential in the size of the input sequence and in the 
case where the input is a sequence of real numbers, there could be uncountably 
infinite number of motifs (assuming two real numbers are equal if they are within 
some (5 > 0 of each other). 



Usually, this problem of a large number of motifs is tackled by pre-processing 
the input, using heuristics, to remove the repeating or self-similar portions of 
the input or using a “statistical significance” measure However, 



due to the absence of a good understanding of the domain, there is no con- 
sensus over the right model to use. Thus there is a trend towards model-less 



motif discovery in different fields 



.fiWMmBSirS SMzm 



|: we use the same ap- 



proach to the pattern discovery problem in this paper. There have been empirical 
evidence showing that the run time is linear in the output size for the rigid 
motifs I 
mentations 




and experimental comparisons between available imple- 
^^^ 3 . However, none of the currently known algorithms 
have proven output sensitive complexity bounds and the only known complexity 
bounds are all exponential in the input size n. 



In order to apply motif discovery techniques to real life situations one has 
to deal with the fact that in many applications the input is known with a mar- 
gin of error. Many amino acids in protein sequences, for instance, are easily 
interchanged by evolution without loss of function 



Also the use of 

distance matrices in the context of DNA sequences such as PAM ^£^3 and 
BLOSUM is common. For example, a character a can be viewed as a or 

b for pattern detection purposes (but a b cannot be viewed as an a) . In all these 
situations it is possible to view the input as a string of sets of characters instead 
of just characters. For instance a sequence of the form baccta can be viewed as 
b{a,b}cct{a,b}. In some other applications, the input is an array of real num- 
bers as in the case of micro-array chips and 

two distinct real numbers are deemed identical for pattern detection purposes, 
if they are within some given (5 > 0 of each other. Conventional motif discovery 
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algorithms deal with these situations in an ad hoc manner and there is a need for 
a uniform framework such that the same algorithm can tackle all the scenarios 
described above. 

The main contributions of the paper are as follows. We present an 0((n® -I- 
N) log n) algorithm for the flexible motif discovery problem, where N is the size 
of the output. This is the first known algorithm with a proven output-sensitive 
complexity bound even in the restricted case of rigid motifs. Moreover, we pro- 
vide a uniform framework to handle situations where the input is known with 
a margin of error. As a result our flexible-pattern discovery algorithm effort- 
lessly generalizes to sequences of sets (homologous characters) or real numbers. 
This significantly increases the domain of applicability of our motif discovery 
algorithm. 

The pattern discovery algorithm is based on identifying a unique core set 
of maximal motifs. This idea was introduced in for the case of rigid 

motifs. It was shown that the size of the core is 0{n), althoug h the tot al number 
of maximal motifs could be exponential in the input size. In a uniform 

framework is presented that encompasses both rigid and flexible motifs. It is 
further shown that even in the case of flexible motifs there exists a unique core 
set of motifs which is of size 0{n). This result holds even under added constraints 
on the set of motifs which is central to our algorithm here. 

The algorithm works in two phases: We first detect the core motifs that 
takes no more than 0(n®log(n)) time for flexible motifs in the worst case. In 
the second phase we detect all the other motifs in time linear, ignoring polylog 
factors, in the number of the new motifs, using the core motifs. 

2 Some Preliminary Definitions 

Let s be a sequence of sets of characters from an alphabet E, ^ E. The 
is called a “dont care” or a dot character and any other element is called solid. 
Also, (7 will refer to a singleton character or a set of characters from E. For 
brevity of notation, a singleton set is not enclosed in curly braces. For example 
let E = {A, C, G, r}, then si = ACTGAT and S 2 = {A,T}CG{T,G} are two 
possible sequences. The (1 < j < |s|) element of the sequence is given by 
s[j]. For instance in the above example S 2 [l] = {A,T}, S2[2] = {G}, S2[3] = {G} 
and S2[4] = {T, G}. Also, if a; is a sequence, then |a;| denotes the length of the 
sequence and if a; is a set of elements then |a;| denotes the cardinality of the set. 
Hence |si| = 6, |s 2 | = 4, |si[l]| = 1 and |s2[4]| = 2. 

Definition 1. (ei ^ 62 ) ci ^ 62 if and only if e\ is a “dont care” character or 
ei C 62 - 

The flexibility of a motif is due to the variability in the number of dot char- 
acters and this is done by annotating the dot characters. 

Definition 2 . (Annotated dot Character, .°‘) An annotated character is 
written as .“ where a is a set of non-negative integers {oi, 02 , . . . , Ofc} or an 
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interval a = representing all integers between ai and including ai 

and Uu- 

To avoid clutter, the annotation superscript a, will be an integer interval. 

Definition 3. (Rigid, Flexible String) Given a string m, if at least one dot 
element, is annotated, m is called a flexible string, otherwise m is called rigid. 



Definition 4. (Realization) Let p be a flexible string. A rigid string p' is a 
realization of p if each annotated dot element .“ is replaced by I dot elements 
where I G a. 

For example, if p = a.^^’^^b.^'^’^^cde, then p' = a...b...cde is a realization of p 
and so is p” = a...b cde. 

Definition 5. (p Occurs at 1) A rigid string p occurs at position I on s ifp[j] ^ 
s[/ + j — 1] holds for 1 < j < |p|. ^ flexible string p occurs at position I in s if 
there exists a realization p' of p that occurs at 1. 

If p is flexible then p could possibly occur multiple times at a location on a 
string s. For example, if s = axbcbc, then p = occurs twice at position 1 

as aa;bc6c and aa;5cbc. This multiplicity of occurrence increases the complexity 
of the algorithm over that of rigid motifs in the discovery process as discussed 
in the next section. 

Definition 6. (Motif m. Location List Cm) Given a string s on alphabet S and 
a positive integer k, k < |s|, a string (flexible or rmd) m is a motif with location 
list Cm = ■ ■ ■ , Ip), */w[l] yf f m[|m|] yf f ^and m occurs at each I G Cm 

and there exists no I' , I' ^ Cm md m occurs at I' with p > k. 



Definition 7. (Realization of a Motif m) Given a motif m on an input string 
s with a location list Cm o,nd m! a realization of the string m, then m! is a 
realization of the motif m if and only if there exists some k G Cm such that m! 
occurs at k in s. 

Notice that because of our notation of annotating a dot character with an integer 
interval (instead of a set of integers), not every realization of the flexible motif 
occurs in the input string. In the remaining discussion we will use this stricter 
definition of motif realization (Definition H unless othewise specified. 

Definition 8. (mi ^ m 2 ) Given two motifs mi and m 2 with \mi\ < \m 2 \, 
mi Cl 1 TI 2 holds if for every realization m( of motif mi there exists a realization 
m '2 of motif m 2 such that m'i[j] C 1 < j < \mi\. 

^ The first and last characters of the motif are solid characters; if “dont care” charac- 
ters are allowed at the ends, the motifs can be made arbitrarily long in size without 
conveying any extra information. 
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For example, let mi = AB..E, m2 = AK..E and m3 = ABC.E.G. Then mi A 
m3, and m2 m3. The following lemma is straightforward to verify. 

Definition 9. (mi = m 2 ) Given two motifs mi and m 2 with |mi| = |m2|, 
mi = m 2 holds if for every realization m( of motif mi there exists a realization 
m '2 of motif m 2 such that m[[j] = m 2 [j], 1 < j < |wi|. 



Lemma 1. If mi A m 2 , then Cmi A £^2- If mi A m2 and m2 ^ m3, then 
mi A m3. 



Definition 10. (Sub-motifs of Motif m) Given a motif m let m[ji], m[j 2 ]) ■■■ 
m[ji] be the I solid elements in the motif m. Then the sub-motifs of m are given 
as follows: for every ji, jk, the sub-motif is obtained by dropping all the elements 
before (to the left of) ji and all elements after (to the right of) jk in m. 



A very natural notion of maximality defined earlier 
below for completeness. 




is given 



Definition 11. (Maximal Motif) Let pi, p 2 , ■ ■ ■, Pk be the motifs in a sequence 
s. Define Pi[j] to be if j > \pi\. A motif pi is maximal in composition if and 
only if there exists no pi, I ^ i with Lp^ = £p, and Pi d, Pi- A motif pi, maximal 
in composition, is also maximal in length if and only if there exists no motif pj, 
j yf i, such that Pi is is a sub-motif of pj and \Cp(\ = |£pj|. A maximal motif is 
maximal both in composition and in length. 



3 Algorithm Prerequisites 

It is quite clear that the number of maximal flexible motifs could be exponential 
in the size of the input s. It has been shown in that there is a small basis 

set of motifs of size 0{n) for every input of size n. The remaining motifs can be 
computed from this set of motifs. We recall the definition and the statement of 
the theorem for the sake of completeness here. 

We now define the notion of redundancy and the basis set. Informally speak- 
ing, we call a motif m redundant if m and its location list Cm can be deduced 
from the other motifs without studying the input string s. We introduce such 
a notion below and the section on “generating operation” describes how the 
redundant motifs and the location lists can be computed from the irredundant 
motifs. 

Definition 12. (Redundant, Irredundant Motif) A maximal motif m, with lo- 
cation list Cm, is redundant if there exist maximal motifs rrii, 1 < i < P, P > 
such that Cm = £mi U Cm 2 ■ ■ ■ U Cm^ and m ^ irii for all i. A maximal motif 
that is not redundant is called an irredundant motif. 




136 Laxmi Parida, Isidore Rigoutsos, and Dan Platt 



Notice that for a rigid motif p > 1 (p in Definition since each location list 
corresponds to exactly one motif whereas for a flexible motif p could have a 
value 1 . For example, let s = axfygsbapgrftb. Then mi = fM’^^b, m2 = 

m3 = a....b with Cmi — = {Ij 8}. But m3 is redundant 

since m3 ^ mi, m2- Also mi m2 and m2 mi, hence both mi and m2 
are irredundant although Cmi = ^m.2- This also illustrates the case where one 
location list corresponds to two distinct flexible motifs (motifs mi and m2 are 
distinct if mi = m2 does not hold). 

Generating Operations. The redundant motifs need to be generated from the 
irredundant ones, if required. We define the following generating operations. 
The binary OR operator is used in the algorithm in the process of motif 
detection and the AND operator ^ in the generation of redundant motifs from 
the basis. 

Given an input sequence s, let m, mi and m2 be motifs. 

Binary AND operator, mi ^ m2 : m = mi ® m2 , where m is such that 
m ^ mi, m2 and there exists no motif m' with m ^ m' . For example if mi = 
A.D.[ 2 . 4 ]g and m2 = Then, m = mi © m2 = A... [ 2 A]q^ 

Binary OR operator, mi m2: m = mi m2, where m is such that mi, m2 
^ m and there exists no motif m' with m' ^ m. For example if mi = A..D..G 
and m2 = AB...FG. Then, m = mi ©m2 = AB.D.FG. 

Definition 13 . (Basis) Given an input sequence s, let M be the set of all maxi- 
mal motifs on s. A set of maximal motifs B is called a basis ofM iff the following 
hold: 

1 . for each m G B, m is irredundant with respect to B — {m}, and, 

2 . let G(A) be the set of all the redundant maximal motifs generated by the set 
of motifs X, then A 4 = G(B). 

The following theorem has been proved in and we give only the 

statement here. 

Theorem 1 . Let s be a string with n = |s| and let B be a basis or a set of 
irredundant flexible motifs. Then B is unique and \B\ = 0 {n). 

We give a useful corollary to this theorem below. 

Corollary 1 . Given an input sequence of length n, let M be a set of motij^ 
not necessarily maximal, with the following properties:, 

1 . For each p,q G M, p q, let p' be a suffix string of p and p' q, unless 

l^pl ^ \^q\ 

2 . there does not exist p G M such that Cp = UCq^ and p qi for all i. 

Then \M\ = 0 {n). 

For example if mi = ABC with Crm = {1, 5, 7} with mi G M and m 2 = AB with 
Cm 2 = {li5, 7} then m 2 ^ M. However, if Cm 2 = {li5, 7, 12} [Cm^ C Cm 2 ), then 
m 2 could belong to M. 



2 
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This result is used in the algorithm to bound the number of non-maximal motifs 
at each iteration of the algorithm. We next describe two problems on sets: the 
Set Intersection Problem (SIP) and the Set Union Problem (SUP) which are 
used in the pattern discovery algorithm in the next section. 

The Set Intersection Problem, SIP(n,m,l). Given n sets S\, S 2 , ■ ■ ■ , Sn, on m 
elements, find all the N distinct sets of the form 5^^ n 5^2 n . . . n Si with p > 1. 
Notice that it is possible that N = 0(2"). We give an 0{N log n+mn) algorithm, 
described in Appendix^J to obtain all the intersection sets. 

The Set Union Problem, SUP(n, m). Given n sets Si, S 2 ■■■, Sn on m elements 
each, find all the sets Si such that Si = Si„ U Si^ Li ... U Si^ i ^ ij, 1 < j < p. 
We present an algorithm in Appendix^to solve this problem in time 0{n‘^m). 



4 The Pattern Discovery Algorithm 

The algorithm can be described as follows. It begins by computing one-character 
patterns and then successively grows them by concatenating with other patterns 
until it cannot be grown any further. However, the drawback of this simplistic 
approach is that the number of patterns at each step grows very rapidly. We solve 
the problem by first computing only the basis set. This is done by trimming the 
number of growing patterns at each step and using theorem ^to bound their 
number by 0{n). Thus in time 0(n® log n), the basis can be detected. In the next 
step the remaining motifs from the basis is computed in time “proportional” to 
their number. 

Input Parameters. The input parameters are: (1) the string (2) the minimum 
number of times a pattern must appear fc, (3) the flexibility of the dot characters 
A. The flexibility property has the following interpretation: given a flexibility of 
A, we accept dot character annotations of the form [ai, 02 ] where {02 — ai) < A. 
For the rest of the algorithm we will assume that the alphabet size lAj = 0(1). 

We use the following notation, given a motif m (not necessarily maximal): 
F{m) denotes the first element of m and E{m) denotes the last element of m. 
Note that F{m) ^ and E{m) ^ = {{i,j)\m' is the realization of m 

that occurs at i and ends at j}. Note that Cm = {*|(b — ) € 

4.1 Computing the Basis 

This proceeds in two major phases as follows: 

Step 1: Pattern initialization phase 

® Recall that each element of s is a character or a set of characters from the alphabet 
S or even real numb ers. In the case of the input is a sequence of real numbers we have 
shown in that this problem can be mapped onto an instance of a pattern 

discovery problem on strings of sets of characters. Thus the treatment discussed in 
this paper also extends to flexible patterns on real number sequences. 
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Repeat Until Done: 

Step 2: Pattern concatenation phase 



Pattern Initialization Phase. This proceeds in the following steps. 

Step 1.1. For every a G S, construct m = a and = {(z, z)|s[z] = a}. F{m) = 
E{m) = a. 

This takes 0(n) time. 

Step 1.2. This step is required only while dealing with strings on sets of char- 
acters. For example if mi = {5, c, d\ and m 2 = {6, c, e}, we need to check if 
there exists m = {6, c}. Note that Lm = Pmi U while the characters in 
m are the intersection of the sets of characters in mi and m 2 . We solve this 
using the Set Intersection Problem SIP(|i7|, k,2). 

Assuming |i7| = 0(1), this takes O(n^) time. 

Step 1.3. Let m = m\d 2 denote the string obtained by concatenating the ele- 
ments mi followed by d characters followed by the element m 2 . 

For d = 0 . . .n, construct the motif m = midj and = {(x, a; d-d) | (a;, a;) G 

+ d,x+ d) G F{m) = F{mi), E{m) ='E{mj). 

This takes O(n^) time and the number of motifs at this step is 0{n). 

Step 1.4. In the case of flexible motifs, construct the following flexible motifs. 
Construct sets of motifs P such that for all m,, mj G P, F{mi) = F{mj) and 
E{mi) = E{mj). For each such set P, for I = 0 .. .n — A, m = 
and H'muj F{m) = F{mi), E{m) = E{mj). 

This takes 0{n^) time and the number of motifs at this step is 0{n). 

Step 1.5. This is the pruning step. We do two kinds of pruning: (1) where all 
suffix motifs are removed and (2) where all the “redundant” motifs are re- 
moved. For the former, we offset every location list to zero and check for 
identity of location lists. The latter is described below. Let C denote all 
the location lists of the motifs constructed in Step 1.3. Using the Set Union 
Problem SUP(|£|, n) remove all the motifs whose location list is exactly the 
union of some other location lists. If Lm = LLmi, remove m and update 
each mi as m, = mi(^m and if |m| > |mi|, E{mi) = E{m). 

For example, if mi = a.b, m 2 = a..c and m = a...d with Cm = Lmi U Cm 2 
then mi is updated as mi = a.b.d and m 2 is updated as m 2 = a..cd. 

This step takes O(n^) time and the number of motifs at this step is 0{n). 

Pattern Concatenation Phase. This proceeds in the following steps. 

Step 2.1. Consider every pair of motifs mi and m 2 with E{mi) A F^m^) or 
F{mi) A E{m 2 ). Let I = |mi|. Define m = mi -I- m 2 as follows: 

If E{mi) A F{m 2 ) then 



m[z] 



mi [z] i <l 

m 2 [z — ^ -I- 1] z > ^ 
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If F{mi) A £{ 7712 ) then 



m[i] 



mi[i] i < I 

m 2 [i — I + l]i>l 



Further, = {(z,j)|(z, fc) G and (fc,j) G Cm^}, F{m) = F{mi) and 
E{m) = E{mj). 

For rigid motif m, |£^| < n and for flexible motif m, |£^| < n^, hence 
this step takes 0{n^) time for rigid motifs and takes O(rz^) time for flexible 
motifs. 

Step 2.2. This is the pruning phase and is the same as Step 1.4. This takes O(rz^) 
time. Using CorollaryJ(prune (1) takes care of the first and prune (2) takes 
care of the second condition in the Corollary), at the end of this step there 
are 0{n) patterns. 



The number of iterations is log J where J is the length of the longest motif 
in s. Since J is bounded by n, the algorithm takes 0{n‘^ log n) to detect the basis 
for rigid motifs and 0{n^ log n) in the case of flexible motifs. 



4.2 Computing the Redundant Maximal Patterns 

A redundant maximal motif m is of the form mi ® m 2 ® ® mp for some 

p and Cm = Cmi U Cm 2 U . . . U Cm^, ■ We give an example below to show that 
a straightforward approach of combining (using the operator compatible! 
motifs does not give the desired time complexity. 

Example 1. Let mi = ab...d, m 2 = a...cd, m 3 = a.e..d, m^ = a..f.d with 
Cm^ = {10,20}, Cm2 = {30,40}, Cm, = {20,40}, Cm, = {10,30}. Then Cm, = 
{Cm,CCm2^Cm,CCm,}, Cm, = {Cm,CCm,CCm,}. 

Cm, — {l^mi C Cm 2 C 1 ^ 7714 }, Cm, — \^Cm, C Cm 2 C -^ 7773 } are such that m 5 — 
me = my = m% = mg = a....d. In other words, the motif ms is constructed at 
least four more times than required. 

We give below an output-sensitive algorithm to compute all the redundant mo- 
tifs. 

Given B the set of all the irredundant motifs, construct V a set of subsets 
of B as follows: P G P, if for each motif mi,mj G P, without loss of generality, 
F{mi) < F{mj) and m^ mj, and P is the largest such set. For each P G V, 
we construct an instance of the Set Intersection Problem SIP as follows. We 
claim that the union of the solutions to each of the SIP gives all the maximal 
redundant motifs in time 0{Nlogn). We illustrate this through two examples 
and omit the formal arguments due to space constraints. Recall that N is the 
number of maximal motifs and n is the length of the input sequence. 

For each P G V do the following. Let I = maxmGp|m|. Construct m[z], 
2 < i < I as follows. m[z] = {a yf‘.’|cr A p[i],p G P}. Note that it is possible that 

^ Two motifs mi and m2 are compatible, without loss of generality, if mi[l] ^ m2[l] 
and there is z s.t mi[z] 7^ m2[z] 7^ and mi[z] ^ m2[z], 1 < z < min(|mi|, |m2|). 
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fh[i] = {} for some i. Now construct an instance of SIP(iV', M, 2) as follows. The 
M elements on which the sets are built is a subset of the basis set and M = |P|. 
The N' sets are constructed as follows. Si = {mi\rh[j\ = e} for all possible 
values of j and e and |5^| > 2. Assuming that S — 0(1), the number of such 
sets N' = 0 {n). Recall that n is length of the input string s whose motifs are 
being discovered. Each SI with > 2 corresponds to a maximal redundant 
motif. Although, same location lists may give distinct flexible motifs, this does 
not cause any problems since we use the solid characters of the motifs in P. 

As an illustration we first show an example involving rigid motifs. 

Example 2 . Let mi = abc.d, m2 = abe, m3 = add.d, m^ = ad..e, ms = ab..d. 
Here I = 5 and Sf = {mi, m2, ms}, = {m3,m4}, S^ = (mi, m3, ms}. 





1 


2 


3 


4 


5 


mi 


a 


b 


c 




d 


m2 


a 


b 


e 






m3 


a 


d 


d 




d 


m4 


a 


d 






e 


ms 


a 


b 


a 




d 



Each of the set corresponds to a maximal redundant motif. For example S'^ gives 
the maximal redundant motif mi 0 m2 0 ms = ab with location list Cmi U£m2U 
Cms , S^ gives m3 0 m4 = ad with location list Cm3 , mi 0 m3 0 ms = 

a...d with S^ gives location list Cmi U Cm3 ^"15 , The results from SIP give 
the unique intersection set {mi,ms} and this corresponds to the motif m = 
wi 0 ms = ab..d with Cm = Cmi U £^5- 

Consider an example using flexible motifs. 

Example 3 . Let mi = abP’^^ec, m2 = abM-^^bc, m3 = abM-^^be and m^ = 
a.[i-31c6. 





1 


2 


3 


4 


5 


mi 


a 


b 




e 


c 


m2 


a 


b 


01^ 


b 


c 


m3 


a 


b 


01^ 


b 


e 


m4 


a 


01^ 


c 


b 





Here I = 5. The different sets are S^ = (mi, m2, m3}, S^ = |m2,m3,m4}. 
Si = (mi, m2}. 

Each of the sets corresponds to a maximal redundant motif. S^ gives mi 0 m2 
0 m3 = ab, with location list Cmi C Cm2 C Cm3 1 gives m2 0 m3 0 m4 = 
aji.3] [1,2]^ = a.i^’^l^lwith location list Cm2 U£m3 U£m4, Si gives mi © m2 = 
[1,3] [1,2]^ = abP’^^c with location list Cmi U£m2- The intersection results from 
SIP gives {m2, m3} with m = m2 0 m3 = abM’^^b and location list Cm2 C Cm3- 

® Since 
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A The Set Intersection Problem, SlP{n,m,l) 

Given n sets Si, S 2 , ■ ■ ■ , Sn, on m elements, find all the N distinct sets of the 
form Si, n 5^2 n . . . n Si^ with p > 1. We give an 0{N log n + mn) algorithm 
below to obtain all the intersection sets. 

Let the elements be numbered 1 . . . m. Construct a binary tree T using the 
subroutine CREATE-NODE shown below. 

Assume a function CREATE-SET(5) which creates S, a subset of ^i, 52, ... , 
Sn in an appropriate data structure V (say a tree) . A query of the form if a subset 
S G T> (D0ES-EXIST(5)) returns a True/False in time O(logn). 

Node CREATE-NODE (5, h, 1) 

{ 

(1) New(this-node) 

(2) CREATE-SET(5) 

(3) Let S' = {Si G S\h G Si} 

(4) if ((|5'| > 1) and not D0ES-EXIST(5') and {h > 2)) 

(5) Left-child = CREATE-N0DE(5', h- 1,1) 

(6) Right-child = CREATE-N0DE(5, h- 1,1) 

(7) return (this-node) 

} 

For I = 2, there is exactly one node the tree T. For I > 2, the initial call is 
CREATE-NODE({5i, S 2 , ■ ■ ■, 5„}, m, 1). Clearly, all the unique intersection sets, 
which are N in number are at the leaf node of this tree T. Also, the number 
of internal nodes can not exceed the number of leaf nodes, N. Thus the total 
number of nodes of T is 0{N). The cost of query at each node is O(logn) (line 
(4) of CREATE-NODE). The size of the input data is 0{nm) and each data 
item is read exactly once in the algorithm (line (3) of CREATE-NODE) Hence 
the algorithm takes 0{N\ogn + nm) time. 

B The Set Union Problem, SUP(n,m) 

Given n sets Si, S 2 ■ ■ ■ , Sn on m elements each, find all the sets Si such that 
5i = 5q U 5,2 U . . . U Si^ i yf ij, 1 < j < p. 

This is a very straightforward algorithm (this contributes an additive term 
to the overall complexity of the pattern detection algorithm): For each set 5,, 
we first obtain the sets Sj j ^ i, j = 1 .. .n such that Sj C Si. This can be done 
in 0{nm) time (for each i). Next, we check if CjSj = Si. Again this can be done 
in 0{nm) time. Hence the total time taken is 0{ri^m). 
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Abstract. The episode matching problem is considered and the method 
for preprocessing the text is presented. Once the text is preprocessed, an 
episode substring can be found in time linear to the length of pattern 
(episode). 



A subsequence of a string T is any string obtainable by removing zero or more 
symbols from T. Given two strings, pattern S and text T, an episode substring 
is a minimal substring a of T that contains S' as a subsequence. Minimal means 
that no proper substring of a contains S as a subsequence. The episode matching 
problem is to find all episode substrings. All strings in this paper are considered 
on alphabet E of size a. 

The problem arises in analyzing sequences of events, e.g. alarms from a 
telecommunication network, actions from a user, or records from a WWW-server 
log file. Knowledge of frequent episode substrings can then be used to describe or 
predict the sequence. The first notion about the problem comes probably from 
Mannila, Toivonen and Verkamo Their solution requires 0{nmk) time, where 
m is the length of S, n is the length of T, and k is the number of episodes (in our 
case is k = 1). Das et al. Q proposed several algorithms with the following time 
complexities: 0{nm), and 0{n + s+ when additional space 

is limited to 0(s). Boasson et al. | showed that the problem is linear to n and 
designed an algorithm which is exponential to m and linear to n. All algorithms 
mentioned so far either do no preprocessing or preprocess the pattern. 

The presented approach is based on preprocessing the text. We build the 
Episode Directed Acyclic Subsequence Graph (EDASG) which allows to find an 
episode substring in 0{m) time. Building the EDASG requires 0{na) time. Let 
T — tit2 . . .tji and S = S\S2 • • • Sm- 

First, we shortly recall the Directed Acyclic Subsequence Graph (DASG) 
which is used in the subsequence matching problem. Given a text T, the DASG 
is a finite automaton that accepts all subsequences of T. A finite automaton is, 
in this paper, a 5-tuple (Q, A, S, go, F), where Q is a finite set of states, A is an 
input alphabet, 5 is a transition function, go is the initial state, and F C Q is 
the set of final states. 

* This research has been supported by GACR grant No. 201/01/1433. 

A. Amir and G.M. Landau (Eds.): CPM 2001, LNCS 2089, pp. 2001. 

@ Springer-Verlag Berlin Heidelberg 2001 
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Let Q = {qo, qi, . . . , qn} and F = Q. For each a G F and i G (0, n) we define 
the transition function S as follows: 

S{qi, a) — qj if there exists k > i such that a = tk and j is minimal such k, 

6{qi, a) = 0 otherwise. 

Then the automaton A = (Q, if, 6, qo, F) accepts a pattern S if and only if S 
is a subsequence of T. Two algorithms, right-to-left Q and left-to-right Q, are 
known for building the DASG in 0{na) time. The DASG for T allows to check 
whether S' is a subsequence of T in 0{m) time. We note that each state q G Q 
corresponds to a prefix of T such that this prefix is the longest path from q^ 
to q. 

Lemma 1. Let u, v,x,y £ E* such that u is a subsequence of v and x is a 
subsequence of y. Then ux is a subsequence of vy. 

Proof. The set of all subsequences of a string mu^- ■ - Ui can be described by 
regular expression {e + ui){e + u ^) . . . (e -|- ui). The lemma directly follows. □ 

Lemma 2. Given i G (0, n), the automaton Ai = (Q, E, S, qi, F) accepts S iff S 
is a subsequence of tt+i . . .t„. 

Proof. Let Ti = ti+i . . .t„. We prove two implications: 

1. If Ai accepts S then S is a subsequence of Ti. Ai accepts S, thus A accepts 
ti . . . tiSi . . . Sm- Since ti . . . Usi . . .Sm is a subsequence of ti . . . we get that 
s\ . . .Sm is a subsequence of 

2. If 5 is a subsequence of Ti then Ai accepts S'. 5 is a subsequence of Ti, hence 

A accepts t\ . . .tiS\ . . . Sm- When t\ . . .ti has been read, the automaton A is in 
state qi and consequently Ai accepts si . . . Sm- □ 



a 




Fig. 1. The EDASG for text T = abba. 



Lemma 3. Let i G (0, n) and let qj be the active state of the automaton Aj = 
{Q, E, 6, qi, F) after reading through S, i.e. S*{qi, si . . . Sm) = <lj- Then j is mini- 
mal k such that ti+i . . .tk contains S as a subsequence, i.e. j = min{k : si . . .Sm 
is a subsequence of ti+\ . . .tk}. 
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Proof, (by contradiction) Suppose that there exists k < j such that si . . . Sm is a 
subsequence of ti+i . . . tfc. Then si . . . Smtk+i ... is a subsequence of ti+i . . .tn 
and therefore it should be accepted by Ai. We note that Ai is in after reading 
through S. But Aj accepts strings of length at most n — j, which contradicts the 
hypothesis that k < j. □ 

Let A^ = (Q^, if, Qm F^) be the DASG for the reversed text = 
tn ...ti, i.e. . ..,qo},F^ = and 

, a) = qj if there exists k < i such that a = tk and j is maximal such k, 
S^{qi, a) = 0 otherwise. 

The lemma H says that the active state qj of A after reading through S 
determines the end of episode substring. Its begin can be found using the DASG 

Hence, to find an episode substring we need two 
DASGs: for text T and for reversed text T^. Since Q = we can let both 
automata A and A^ share the set of states and combine them to the Episode 
Directed Acyclic Subsequence Graph. Formally we can say that the EDASG is a 
6-tuple (Q, A, (5, (5^, q^, F). One can also note that we extend the automaton A 
with transitions of the automaton A^. An example of the EDASG is in Fig.J 



procedure build_edasg (T) 
input: text T = t\t 2 ■ ■ .t„ G E* 
output: the EDASG for text T 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 



for all a € E do 

f[a\ ^ 0 
l[a] < 1 

end for 

create state qo and mark it as final 
for i = 1 to n do 

add state qi and mark it as final 
for j = f[ti] to {i — 1) do 

add 5 transition labeled ti between states qj and qi 
end for 
f[ti] ^ i 
l[ti] <— i — 1 

for all a £ E , l[a] A do 

add transition labeled a between states qi and qi[a] 

end for 
end for 



Algorithm 1: Building the EDASG. 



Lemma 4. Let qj be the aetive state of the automaton A = (Q, A, 5, q, F) 
after reading through S, and qi be the aetive state of the automaton A^ = 
{Q^, E,S^,qj, F^) after reading through = Sm-.-Si. Then ti+i...tj is a 
minimal substring of T that contains S as a subsequence. 
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Proof. We use lemma | for A and A^. □ 

The algorithm for building the EDASG is a combination of right-to-left and 
left-to-right algorithms for building the DASG. 

Once the EDASG has been built, we can find an episode substring in two 
steps: (1) we find the end of the substring (the EDASG simulates the DASG for 
T: it reads S and uses transitions 6), (2) we find the begin of the substring (the 
EDASG simulates the DASG for T^: it reads and uses transitions S^). The 
substring is determined by the active states after the first and second step. If qi 
is the active state after the second step, we can start to look for a next episode 
substring at state qi+i- 

For analysis of the algorithm we assume that alphabet E has no useless 
symbols, i.e. all symbols of E occur in T. Since each EDASG consists of two 
ordinary DASGs, the numbers of its <5 and 5^ transitions are both 0{na). Thus, 
building the EDASG requires 0{na) time and 0{a) extra space. 

Finding an episode substring needs 0{m) time. Since there can be 0{n) 
episode substrings, we need 0{nm) time to find all episode substrings in the 
worst case. But once we preprocess the text, finding an episode substring is 
extremely fast, which is the major advantage of the presented approach. 

If the alphabet is not known in advance, we use a balanced tree for values f[a] 
and l[a] (a S E). The time complexities mentioned above are then multiplied by 
factor log a. 
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1 Introduction 

Identification of similar objects from a large collection of objects is one funda- 
mental technique in several different areas in computer science, e.g., the case- 
based reasoning and the machine discovery. Strings are the most basic represen- 
tations of objects inside computers, and thus string similarity is one of the most 
important topics in computer science. 

Similarity measure must be sensitive to the kind of differences we wish to 
quantify. The weighted edit distance is one such framework in which the measure 
can be varied by altering weight assignment to each edit operation depending 
on symbols involved. However, it does not suffice to solve ‘real problems’ (see 
e.g., I). It is considered that two objects have necessarily a common structure 
if they seem similar, and the degree of similarity depends upon how valuable the 
common structure is. Based on this intuition, we present a unifying framework, 
named string resemblance system (SRS, for short). In this framework, similarity 
of two strings can be viewed as the maximum score of pattern that matches 
both of them. The differences among the measures are therefore the choices of 
(1) pattern set to which common patterns belong, and (2) pattern score function 
which assigns a score to each pattern. 

For example, if we choose the set of patterns with variable length don’t cares 
and define the score of a pattern to be the number of symbols in it, then the 
obtained measure is the length of the longest common subsequence (LCS) of two 
strings. In fact, the strings acdeba and abdac have a common pattern a*d*a* 
which contains three symbols. With this framework one can easily design and 
modify his/her measures. In this paper we briefly describe SRSs and then report 
successful results of applications to literature and music. 

2 Unifying Framework for String Similarity 

In practical applications such as biological sequence comparisons, it is often pre- 
ferred to measure similarity rather than distance between two given strings. We 
shall regard a distance measure as a similarity measure by multiplying the dis- 
tance values by —1. Gusfield ^ pointed out that in dealing with string similarity 
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the language of alignments is often more convenient than the language of edit 
operations. Here, we generalize the alignment based scheme and propose a new 
scheme which is based on the notion of common patterns. Before describing our 
scheme, we need to introduce some notation. The set of strings over an alphabet 
S is denoted by S* . The length of a string u is denoted by |u|. The string of 
length 0 is called the empty string, and denoted by £. Let = S* — {e}. Let 
us denote by R the set of real numbers. 

Definition!. A string resemblance system (SRS) is a 4^-tuple {S, U, , 
where: 

1. S is an alphabet; 

2. n is a set of descriptions called patterns; 

3. L is a function called interpretation that maps a pattern in II to a language 
over S, i.e., a subset of S* ; 

4- (I is a function that maps a pattern in II to a real number called score. 

The similarity between strings x and y with respect to {S, II, L, <1) is defined by 

SIM{x, y) = sup{<?(7r) | tt G 7T and x,y € L{n) }. 

We would assume that, for any x,y & S* , the set {^{n) | tt G iT and x,y & L{tt)} 
is non-empty, bounded upwards, and contains the least upper bound as a mem- 
ber. This assumption guarantees that for any x,y € S* there always exists a 
pattern n G II common to x and y that maximizes the score Thus, com- 

putation of similarity is regarded as optimal pattern discovery in our framework. 
In this sense our framework bridges a gap between similarity computation and 
pattern discovery. 

Definition 2. An SRS {S, II, L,<P) is said to be homomorphic if 

1. n = (S U A)*, where A is a set o/ wildcards. 

2. L : n ^ 2^ is a homomorphism such that L{c) 

L(7ri7T2) = L(7ri)L(7T2) for any G II. 

3. <P : n ^ R is a homomorphism such that 

TTl, 7T2 G n. 

Note that when S is fixed, a homomorphic SRS is determined by specifying (1) 
the set A of wildcards, (2) the values L{'j) for all 7 G Z\, and (3) the values <^(7) 
for all 7 G 27 U Z\. 

The class of homomorphic SRSs covers most of the known similarity (dissimi- 
larity) measures. For example, the edit distance falls into this class. Let Z\ = {if} 
where if is the wildcard that matches the empty string and any symbol in 27, 
namely, L{if) = 27 U {e}. Let <I{if) = —1 and I>{c) = 0 for all c G 27. Then, the 
similarity measure defined by this homomorphic SRS is the same as the edit dis- 
tance except that the values are non-positive. Similarly, the Hamming distance 
can be defined by using the wildcard <f that matches any symbol in 27. 

We can define the LCS measure by using the wildcard * that matches any 
string in 27*. Namely, the homomorphic SRS such that (1) A = {*}, (2) Lfk) = 



= {c} for any c G 27 and 
= <?(7Ti) -I- I>(tt 2 ) for any 
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E* , and (3) <?(*) = 0 and <P{c) = 1 for any c G E gives the LCS measure. 
Although another definition is possible for this measure which uses the wildcard 
tjj with L{tjj) = E U {e}, but the common patterns obtained are much simpler. 

The weighted edit distance can also be defined as a homomorphic SRS in 
which the wildcards (j){a\b) (a, b G AU{e} and a ^ b) such that L{(j){a\b)) = {a, b} 
are introduced, and <P{(j){a\b)) is the weight assigned to each pair of a and b. 

Next, we extend the class of homomorphic SRSs by easing the restriction 
on the pattern score functions as follows. A pattern score function <P defined on 
n = {E U A)* is said to be semi-homomorphic if there exists a subset V of U 
with £ ^ T> and 11 = 1)*, and a function g -.V => R such that, for any n G II, 

i 

l>Q,'KiG'D{i = l,...,t), and tt = tti • • • 7 t^|. 

i=l 

Definition 3. An SRS {E, II, L, <I) is said to be semi-homomorphic if 

I n = {E U A)*, where A is a set of wildcards. 

2. L : n ^ 2^ is a homomorphism such that L{c) = {c} for any c G E and 

L( 7 ri 7 T 2 ) = L( 7 ri)L( 7 T 2 ) for any G II. 

3. !> : n ^ R is semi-homomorphic. 

Computation of the weighted edit distance between two given strings x and 
y can be viewed as computation of the lowest scoring paths from node (0, 0) 
to node (|a;|, |y|) in the weighted edit graph (see, e.g., Q), a directed (acyclic) 
weighted graph where the vertices are the (|a;|-|-l) x (|?/|-|- 1 ) points of the grid 
with rows 0, . . . , |a;| and columns 0, . . . ,\y\. The computation can be done by 
standard dynamic programming in 0 (|a;||j/|) time. 

A similar discussion is possible for (semi-)homomorphic SRSs, with appropri- 
ate modifications in the definition of weighted edit graph. The construction time 
of such graph depends upon the response time for membership query “w G I(j)” 
for a wildcard 7 in Z\ and upon that of I>. However, once such graph is con- 
structed, the best score can be computed in linear time with respect to the 
number of edges in the graph, which varies depending upon A (and upon T> 
in the case of semi-homomorphic SRSs). It would be interesting to reveal the 
hierarchy of subclasses of SRSs from the viewpoint of computational complexity, 
but this is beyond the scope of the present paper. 

As demonstrated so far, we can handle a variety of string (dis) similarity by 
changing the pattern set II and the pattern score function <1. The pattern sets 
discussed above are, however, restricted to the form II = (HU Z\)*, where Z\ is a 
set of wildcards. Here we shall mention pattern sets of other types. An order-free 
pattern is a multiset {ui, . . . , Uk} such that k > 0 and u\, . . . ,Uk G A+, and is 
denoted by tt[ui, . . . , Uk]. The language of pattern 7 t[ui, . . . , Uk] is defined to be 
the union of the languages A’*Uo.(i) A* • • • E*Ucr(k)^* over all permutations a of 
{!,... ,k}. For example, the language of the pattern 7 r[a 6 c, de] is E* abcE* deE* U 
E* deE* abcE* . The membership problem for order-free patterns is NP-complete, 
and therefore the similarity computation is impractical generally. However the 
problem is polynomial-time solvable when k is fixed. 



<P{tt) = max|^5(7Ti) 
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The pattern languages, introduced by Angluin is also interesting for our 
framework. A pattern is a string in 7T = (A U V)~^ , where V is an infinite set 
{xi,X 2 , ■ ■ ■} of variables and E = 0. The language of a pattern tt is the 
set of strings obtained by replacing variables in tt by non-empty strings. The 
membership problem is NP-complete for the class of patterns as shown in Q, 
but it is polynomial-time solvable when the number of variables occurring more 
than once within tt is bounded by a fixed number k. 



3 Discovery from Literary Works 

Waka is a form of traditional Japanese poetry with a 1300-year history. A Waka 
poem has five lines and thirty-one syllables, arranged thus: 5-7-5-7-7. In Q we 
attempted to semi-automatically discover similar poems from an accumulation 
of about 450,000 Waka poems in a machine-readable form. One reasonable ap- 
proach is to arrange all possible pairs of poems in decreasing order of their simi- 
larity, and to scholarly scrutinize a first part. One of the aims here is to discover 
unheeded instances of Honkadori (poetic allusion), one important rhetorical de- 
vice in Waka poems based on specific allusion to earlier famous poems. 

We tested three similarity measures for dealing with similarity between Waka 
poems, which were newly designed along with our framework. The first measure 
is based on line-order alternation and on the modified LCS measure for quantify- 
ing affinity between lines, which is defined as a semi-homomorphic SRS such that 
A = {*}, Lfk) = E* , and the pattern score function defined hy V = A+ U {*}, 
and g{n) = |7t| — s, if tt G otherwise, g{n) = 0, where s (0 < s < 1) is a 
penalty for break in continuity of symbols. This measure was proved suitable for 
finding instances of poetic allusion. 

The second and the third measures are based on the order-free patterns 
defined in the previous section, in order to cope with word-order alternation. 
These two measures differ in the respect that the pattern score function of the 
third measure depends on the rarity of common pattern within a given large 
collection of poems, whereas that of the second one is defined syntactically. The 
idea of rarity is proved to be effective in identifying only close affinities which 
are hardly seen elsewhere, possibly excluding known stereotype expressions. 

The first measure is especially favored by Waka researchers and used in dis- 
covering affinities of some unheeded poems with some earlier ones. The discov- 
ered affinities raise an interesting issue for Waka studies: (1) We have proved that 
one of the most important poems by Fujiwara-no-Kanesuke, one of the renowned 
thirty-six poets, was in fact based on a model poem found in Kokin-Shu. The 
same poem had been interpreted just to show “frank utterance of parents’ care 
for their child.” Our study revealed the poet’s techniques in composition half hid- 
den by the heart-warming feature of the poem by extracting the same structure 
between the two poems. (2) We have compared Tametada-Shu, the mysterious 
anthology unidentified in Japanese literary history, with a number of private 
anthologies edited after the middle of the Kamakura period (the 13th-century) 
using the same method, and found that there are about 10 pairs of similar poems 
between Tametada-Shu and Sbkon-Shu, an anthology by Shbtetsu. The result 
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suggests that the mysterious anthology was edited by a poet in the early Muro- 
machi period (the 15th-century). There have been surmised dispute about the 
editing date since one scholar suggested the middle of Kamakura period as a 
probable one. We have had a strong evidence about this problem. 

4 Finding Affinities from Musical Scores 

Any monophonic score can be regarded as a string of ordered pairs consisting of 
the pitch of the note and its length. Mongeau and Sankoff J proposed a dissim- 
ilarity measure for monophonic scores, which is a variant of the weighted edit 
distance where additional two edit operations, fragmentation and consolidation, 
are allowed to associate multiple notes with a single note or vice versa. It is 
reported in Q that the measure arranges the variations on a theme by Mozart 
in a reasonable order which coincides with subjective impressions. However, it 
turned out from our experimental results that a problem arises when dealing 
with the mixtures of variations on several themes. 

In Q we tested three similarity measures and showed that the third one 
could cope with this problem. As a preprocessing, each note in a musical score is 
replaced with a sequence of notes of a unit length (16th note) to obtain simply 
a string of pitches. The measures are respectively based on three measures to 
quantify the affinities between two phrases of uniform length, each falls into 
the class of the semi-homomorphic SRSs. The set U — {S U {</>})* is commonly 
used in the three. While the pattern score function of the first measure is the one 
which simply counts up matches (i.e., the number of symbols in a pattern), those 
of the second and third measures are sensitive to the continuity of matches. More 
precisely, in the second measure it is defined hy T> = A+ U {(/>}, and g{n) = |7t|, 
if 7T S 17+ and |7 t| > s; otherwise, g{Tr) = 0, where s is a threshold. In the 
third measure, T> = {tt G (A U {<('})+ | tt does not contain and g{Tr) is 

the number of symbols within tt, if |7t| > s; otherwise, g(7r) = 0, where s,t are 
thresholds. Despite its simplicity, the third measure is better than Mongeau and 
Sankoff’s one in the sense that it is able to exclude variations on other themes. 
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Abstract. We describe an efficient implementation of a text mining 
algorithm for discovering a class of simple string patterns. With an index 
structure, called the virtual suffix tree, for pattern discovery built on 
the top of the suffix array, the resulting algorithm is simple and fast in 
practice compared with the previous implementation with the suffix tree. 



1 Introduction 

In this extended abstract, we investigated a text mining problem based on the 
framework of optimal pattern discovery^^. We give an efficient implementation 
of an existing text mining algorithm | for finding k -proximity d-phrase patterns 
from a large collection of texts using a data structure, called the virtual suffix 
tree. The virtual suffix tree is a space-efficient alternative to the suffix-tree built 
on the top of the suffix array and can be used for bottom-up traversal of 

the suffix tree allowing reconstruction of the index 

Our algorithm runs in almost linear time with poly-log factor in the total 
length of a text collection in average case for fixed k,d> 0. From the practical 
point of view, the algorithm is conceptually simple, easy to implement, and 
scalable on large text data due to the use of the virtual suffix tree than the 
previous implementation with the suffix tree. 

2 The Class of Patterns 

We introduce the class of patterns to discover Q. Let A be a constant alphabet 
of letters. A k-proximity d-phrase association patterns ((fc, d)-proximity pattern) 
is a sequence of d phrases ( (pi , • • • , Pd) , k) with the bounded gap length k, called 
proximity. Each pi is a substring of a text, called a phrase. The pattern tt matches 
a text if the phrases pi, ... ,pd occur in the text in this order and the distance 
between the occurrences of the consecutive pair of phrases does not exceed k. 
For instance, ((tata),(cacag), (caatcag); 20) is an example of (20, 3)-patterns. 

denotes the class of (fc, d)-proximity patterns. The well-known followed-by 
patterns ^^3 are special case with d = 2. Another definition as gap patterns 
with with bounded gap length is possible, but slight change is needed. 
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3 Framework of Text Mining 

As the framework of data mining, we adopted optimal pattern discovery A 
sample is a pair {S, of a collection of texts S' = {si , . . . , Sm} C E* and a binary 
labeling ^ : S — > {0, 1}, called objective function, which may be a pre-defined or 
human-specified category. 

Let 7^ be a class of patterns and if : [0, 1] ^ R be a symmetric, concave, real- 
valued function called impurity function Q to measure the skewness of the class 
distribution after the classification by a pattern. The classification error ip{x) = 
min(p, 1 — p) and the information entropy ifix) = —plogp — {1 — p) log(l — p) 
are examples of ip We identify each tt € P to a classification function 

(matching function) tt : A* — > {0, 1} as usual. 

Given a sample a pattern discovery algorithm tries to find a pattern 

7T in the hypothesis space V that optimizes the evaluation function G'g ^{tt) = 
Ni)Ni + '0(Mo/Ao)Ao where tuple {Mi, Mq, Ni, Nq) is the contingency 
table determined by tt, ^ over S, namely, Mq, = = 1 3'iid 

Na = where a G {0,1} and [Pred] € {0,1} is the indicator 

function. Now, we state our data mining problem. 

Optimal Pattern Discovery with ip 

Given: a set S of texts and an objective function f : S ^ {0, 1}. 

Problem: Find a pattern tt G V that minimizes the cost ^{tt) within V. 

What is good to minimize the cost G^ ^{tt)7 For the case of the classification 
error ip used in machine learning, it is known that any algorithm that efficiently 
solves the above optimization problem can approximate an arbitrary unknown 
probability distribution and thus can work with noisy environments 



4 Previous Algorithms with SufRx Tree 

In this poster, we consider only the case of fixed d, the maximum number of 
phrases. Otherwise, the problem becomes hard to approximate (MAXSNP-hard) 
for the class Li k,dPt I- fixed d, the optimal pattern discovery problem for 
is solvable by a naive generate-and-test algorithm in time. Although 

this can be reduced to 0{n‘^~^^) time with the suffix tree Q, it is still too slow 
to apply real world problems. 

In Q, we have developed an efficient algorithm, called Split- Merge- with- Tree 
(SMT) for Vf. With the suffix tree Q, SMT uses the known correspondence 
between patterns in and the axis-parallel rectangles in d-dim rank space 
{1, . . . , n}‘^ on a suffix array SMT searches a best rectangle by combining 
the d-dimensional orthogonal range tree and the suffix tree 0. SMT is fast in 
theory, but slow in practice due to the huge space requirement and complicated 
implementation Q. 
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5 Reconstructible Suffix Dictionary 

For overcoming the problem, we implemented the idea used in the SMT algo- 
rithm using a new indexing data structure called the virtual suffix tree described 
below. 

Let T be a text of length n and P = {1, . . . , n} be the set of all positions 
(or index points) where are possible occurrences of patterns. We identify a suffix 
with the index point it starts, and also identify a text T as the set of all of 
its suffixes. Let Q C T be any subset of the suffixes and p be any branching 
substring of T. Define p belongs to Q if P is a prefix of some suffix in Q, and 
define the info of p is any information from which we can retrieve \p\ in 0(1) 
time and can enumerate all K occurrence of p in T in 0{K) time. 

The essence of the previous algorithm SMT is summarized by an abstract 
data type called a reconstructible suffix dictionary D(T) defined with the follow- 
ing operations. 

Reconstructible suffix dictionary 

— Given a text T of length n, create the dictionary D{T) for all suffixes of T 
(Create) in (nlogn) time (Note that we have identified T and the set of its 
positions) . 

— Given a dictionary D{Q) and a subset Q' C Q of index points, reconstruct 
the dictionary D{Q') for all suffixes in Q' in 0(m log m) time, where m = \Q'\ 
is the cardinality of Q' (Reconstruct). 

— Given D{Q), enumerate the info’s of all branching substring in Q in 0{m) 
time, where m = \Q\ (Traverse). 

— Given D{Q) and a string p, detect all the occurrences (or the info) of p in Q 
in 0(|p| logm) time if exists, where m = \Q\. (Search) 

We implemented the above abstract data type by a data structure called a 
virtual suffix tree, which is just the suffix array SA coupled with the height array 
Hgt ^3; fho inverse array Rank of SA and the range minima query for 
constant-time Icp information Here, Hgt is the array that stores the the 

length of the longest common prefixes of adjacent suffixes in S'A Although 
a reconstructible suffix dictionary can be implemented by the suffix tree 
combined with marking technique, the virtual suffix tree has advantages over the 
suffix tree when simplicity, space-efficiency and quick reconstruction (restriction) 
are important. 

For Traverse operation, we developed a linear time algorithm Q that simu- 
lates bottom-up traversal of the suffix tree when and Hgt are given, while 
well-known simulation of the suffix tree by binary search takes 0{n^) time and 
0{nlogn) time without and with Icp info.U), resp. We also developed a linear 
time algorithm for building the array Hgt from SA and T Q. For Reconstruct 
operation, we use the inverse array Rank and the sorting of ranks for update of 
Pos array, and also use constant-time Icp information for update of Hgt array 
in claimed worst-case time complexity. 
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Algorithm Split- Merge-with- Array (SMA): 

— Given-. Integers k,d > 0 and text T. 

— Initialize R = {1, . . . , n} and create the virtual suffix tree D{T) for text T. Invoke 
Find{R, d, k, e) below and report the patterns with the minimal cost. 

Procedure Find{Q, d, k, tt): 

— Given: A set Q of index points, integers d,k > 0, and a sequence tt of phrases. 

— If d = 0, then compute the cost Gg ^(tt) from Q, which corresponding to the set of 
the occurrences of tt in T. Record the pattern (tt; k) with its cost. 

— Otherwise, d > 0. Reconstruct D{Q) from D{T). Then, traverse the info’s of all 
branching substrings p. For each p in D{Q), do the followings: 

(i) Let O be the set of all occurrences of p in Q. Shift each position in O at most 
k positions to the right by adding the skip |p| + g for every 0 < g < k. Let P be 
the resulting set of ranks. 

(ii) Invoke Find{P, d— 1, tt • p) recursively. 



Fig. 1. The algorithm for finding proximity phrase patterns with suffix arrays. 



6 Fast Text Mining Algorithm with SufRx Array 

Let T = si$i • • • SmSm be a text obtained by concatenating all texts in 5 = 
{si,...,Sm} delimited with unique markers $'s. Here, the document id i and 
the label are attached with each position. Fig.Jshows our mining algo- 
rithm Split- Merge-with- Array (SMA) . By an analysis similar to Q, we have the 
following theorem. Details of the proof will appear in elsewhere. 

Theorem 1. For every integers d, k, the algorithm SMA, given a sample (5,^), 
eomputes all K (fc, d)-proximity patterns that minimize the east G'g ^ in expected 
time 0{k'^~^n{logn)‘^ -\- K) and space 0{dn) under the assumption that texts in 
S are randomly generated from a memoryless source. 

By experiments on English texts (15.2MB) Q and Web pages Q, we observed 
that an implementation of SMA runs in several seconds (d = 1) to several min- 
utes (d = 4 and k = 8 words) with WS (Ultra SPARC II, 300MHz, 256MB, 
Solaris 2.6, gcc). This is about 10^ to 10^ times speed-up to the previous imple- 
mentation B of the algorithm SMT with the suffix tree and the range-tree. 
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Abstract. Let i? be a regular expression the size of which is s. Mirkin’s 
prebases and Antimirov’s partial derivatives lead to the construction of 
the same automaton, called the equation automaton of E. The number 
of states in this automaton is less than or equal to the number of states 
in the position automaton. On the other hand, it can be computed by 
Antimirov’s algorithm with an 0(s®) time complexity, whereas there 
exist O(s^) implementations for the position automaton. We present an 
O(s^) space and time algorithm to compute the equation automaton. It 
is based on the notion of canonical derivative which is related both to 
word and partial derivatives. This work is tightly connected to pattern 
matching area since the aim is, given a regular expression, to produce an 
as small as possible recognizer with the best space and time complexity. 



1 Introduction 

The conversion of a regular expression into an equivalent finite automaton 
has many applications, especially in pattern matching The notion of word 
derivative of a regular expression Q and the related notions of continuation Q 
and of partial derivative Q are suitable tools to study this problem. Three fun- 
damental results lead to the construction of three well-known automata. First, 
the set of the aci-dissimilar word derivatives of a regular expression is finite Q, 
which leads to the definition of the deterministic derivatives automaton. Sec- 
ondly, the continuations w.r.t. a given symbol a in a linear expression (i.e. the 
non-null derivatives w.r.t. ua, for all words u) are aci-similar Q, which yields a 
constructive interpretation of the position automaton (classically computed by 
Glushkov Q and McNaughton-Yamada algorithms). And third, the set of 
the partial derivatives of a regular expression is finite Q, which leads to the 
construction of the equation automaton 

The notion of canonical derivative (or c-derivative) developed by the au- 
thors B enlightens the tight connection which exists between the set of partial 
derivatives of a regular expression and the set of word derivatives of its linearized 
version. This notion leads to the definition of the c-continuation automaton, 
whose main interest is that it yields the position automaton by an isomorphism 
and the equation automaton by a quotient. Let us point out that these theo- 
retical results are not related to the work of Hromkovic et al. ^3: firstly, the 
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common follow sets automaton they define has more many states than the po- 
sition automaton, and secondly they do not use any derivative-like tool; on the 
opposite, our results are based on a new algebraic tool: the c-derivatives. 

Given a regular expression the size of which is s the c-continuation automa- 
ton and its quotient can be computed with an O(s^) space and time complexity, 
which significantly improves the 0(s®) complexity of Antimirov’s algorithm Q, 
and takes up the challenge of computing the equation automaton, which is 
smaller than the position one, with the same complexity as the most efficient 
implementations of the position automaton. The set of states is deduced from a 
preprocessing of the starred subexpressions and from an implicit computing of 
the c-continuations. The computation of the set of transitions is based on the 
specific structure of the c-derivatives and on their connection to position sets. 
Let us notice that the techniques we use to handle c-continuations are neces- 
sarily different from the procedures used by Hagenah and Muscholl to 

implement the common follow sets automaton. On the other hand, some refine- 
ments used in this paper to compute the set of transitions lay on the properties 
of the implicit structure we designed in the past to represent the position 

automaton; a very closely related structure is used in ^3^]. 

Section 2 recalls the classical constructions of the position automaton and 
of the equation automaton. Section 3 summarizes theoretical results concerning 
c-derivatives and c-continuations of a regular expression, and their relations with 
word derivatives and partial derivatives. The definition of the c-continuation au- 
tomaton is recalled, as well as the way it is connected to the position automaton 
and to the equation automaton. The new algorithm to build the equation au- 
tomaton is developed in Section 4 which deals with the construction of the set of 
states and in Section 5 which presents algorithmic refinements to compute the 
set of transitions. The new algorithm has been implemented in language C; a 
full example of its output is provided in the Annex A. 



2 Preliminaries 

We assume terminology and basic results concerning regular languages and finite 
automata are known and refer to classical books about these topics. We 

recall the classical constructions of the position automaton and of the equation 
automaton. 

The size of an expression is the length of its suffixed form, and its (alphabetic) 
width is the number of symbol occurrences. Let if be a regular expression of size 
s and width w. Notice that s and w are linearly dependent as far as sequences 
of star operators and occurrences of the empty word and of the empty set are 
carefully handled. However, some computation steps depend on the size of the 
expression and other ones on the alphabetic width. This additional information 
may be helpful, for implementation purpose for instance, or for a deeper analysis 
of the number of operations. This is why we express the complexities w.r.t. both 
s and w. 
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2.1 The Position Automaton of a Regular Expression 

Let E be a linear expression over S and \{E) be defined by: \{E) = 1 if £ G 
L{E) and 0 otherwise. We consider the following sets of symbols: 

- Eirst{E), the set of symbols that match the first symbol of some word in L{E). 

- Last{E), the set of symbols that match the last symbol of some word in L{E). 

- F allow {E,x), for all x in E: the set of symbols that follow the symbol x in 
some word of L{E). 

If if is a regular expression over E, and Pose the set of its positions, we con- 
sider its linearized version E, whose symbols are the positions of E, and we set: 
First(E) = First(E), Last(E) = Last{E), Follow{E, x) = Follow{E, x), for 
all X in Pose- Let h be the mapping from Pose to E induced by the linearization 
of E over Pose- The position automaton Ve of E, whose states are the positions 
of E, and which recognizes L{E), is defined as follows. 



Definition 1 (Position Automaton). The position automaton of E, Ve = 

{Q, E,i,T,S), is defined by: Q = Pose U {0}, i = {0}, T = [if X{E) = 0 then 
Last{E) else Last{E) U {0}], S( 0 ,a) = {x € First{E) \ h{x) = a}, Va G E, and 
5 {x, a) = {y I y G Follow{E, x) and h{y) = a}, Va: G Pose o,nd Va G E. 



Example 1 - Let E = x*{xx + y)* and E = x\{x2X3, + y4)* - We have: First{E) = 
{xi,X2,y4}, Last{E) = {a;i, 0:3, y4}, \{E) = 1, Follow{E,xi) = {a;i, 0:2, y4}, 
Follow{E, X2) = {3:3}, Follow{E, X3) = Follow{E,y4) = {0:2, y4}. 



2.2 The Equation Automaton of a Regular Expression 

The notion of partial derivative of a regular expression is due to Antimirov |||. 





Fig. 1. The position automaton for E 



x*{xx -V y)* - 



Definition 2 (Set of Partial Derivatives w.r.t. a Word). Given a regular 
expression E and a word u, the set of partial derivatives of E w-r.t. u, written 
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du{E), is recursively defined on the structure of E as follows: 

5„(O) = 0 = 5a(l) 

da{x) = {1} if a = X 0 otherwise 
da{E+G) = da{E)Uda{G) 

(% ifG = Q 

da{E ■ G) = I da{E) ■ G J/G^O onrf A(F) = 0 

y da{E) ■ G U 9q(G) otherwise 

da{E*) = da{E) ■ E* 

de{E) = {E} and dua{E) = da{du{E)) 

Example 2. Let E = x*{xx + y)* . We have: dx{E) = {a;*(a;a;-|-?/)* , x(xx + y)*}, 
dy{E) = {(a;a; -I- y)*}, dx{x{xx + y)*) = {{xx + y)*}, dxi{xx + y)*) = {a;(a;a; -I- 
y)*}, dy{{xx + y)*) = {(a;a; -h y)*}. 

Antimirov Q has proved that the cardinality of the set VT>{E) of all the 
partial derivatives of a regular expression E is less than or equal to w -I- 1. Hence 
the definition of £e, the equation automaton of E, whose states are the partial 
derivatives of E, and which recognizes L{E). 

Definition 3 (Equation Automaton). The equation automaton of a regular 
expression E, £e = (Q, E,i,T,S), is defined by: Q = VV{E), i = E, T = {p\ 
X{p) = 1} and 5{p, a) = da{p), Vp G Q G E. 




Fig. 2. The equation automaton for E = x*(xx + y)* ■ 



3 The C-Continuation Automaton of a Regular 
Expression 

The new algorithm we present to compute the equation automaton is based on 
the notion of c-derivative of a regular expression In this section, we review the 

main properties of c-derivatives, and we recall the definition of the c-continuation 
automaton. 
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3.1 C-Derivatives 

Definition 4 (C- Derivative). The c-derivative du{E) of a regular expression 
E w.r.t. a word u is recursively defined as follows: 

da{0) = 0 = da(l) 
da{x) = 1 if a = X, 0 otherwise 
da{E + G) = da{E) if da{E) ^ 0, da{G) otherwise 
da{E ■ G) = da{E) ■ G if da{E) 0, A(F) • da{G) otherwise 
da{E*) = da{E) ■ E* 

de{E^ — E and — ^u^---un{du\{^E^') 

The two following propositions connect c-derivatives respectively to word deriva- 
tives and to partial derivatives. 

Proposition 1. Let E be a linear expression, a be any symbol in E, u and v be 
any words in E* . The following properties hold: 

(a) A non-null c-derivative of E is either 1 or a subexpression of E or a product 
of subexpressions of E. 

(b) u~^E r^aci v~^E ^ du{E) = dy{E) 

(c) The set of non-null c-derivatives dua(E) of a linear expression E reduces to 
a unique expression, the c- continuation of a in E, denoted by Ca- 

Proposition 2. Let E be a regular expression over E, E be its linearized version 
and h be the projection of Pose on E. The set of partial derivatives of E w.r.t. 
u is equal to the set of the projections on E of the non-null c-derivatives of E 
w.r.t. words v over Pose such that h(y) = u. 

As a corollary of Proposition^ the set V'D{E) of all the partial derivatives of 
E is equal to the set of the projections on E of the c-continuations in E. 

Example 3. Let E = x*(xx -j- y)* . We have: 

VV{E) = {a;*(a;a; -I- y)* , x{xx -\- y)* , {xx -\- y)*}. 

The c-continuations in E are: cq = E, Cx^ = xl(x 2 X 3 -\-y 4 )*, Cx 2 = a; 3 (a; 2 a; 3 -I- 
yi)*, Cx 3 = {X 2 X 3 -\- y 4 )* and = (j;2a;3 -I- 2 / 4 )*- 

We verify that: VD{E) = {h{co) , h{cxf) , , h{cxs) , Hoyi)}- 

3.2 The C-Continnation Automaton 

Let E a regular expression. Proposition Jc leads to the definition of the non- 
deterministic automaton Ce, called the c-continuation automaton of E. States 
are pairs (x,Cx), where x is in Pose U {0} and Cx is the c-continuation of x in 
E. Transitions are deduced from the c-derivation of c-continuations Cx. 

Definition 5 (C-Continuation Automaton). The c-continuation automa- 
ton of E, Ce = (Q, E, i, T, <5), is defined by: Q = {(a;, Cx)\x G Pose U {0}}, i = 
(0,co), T = {{x,Cx) I X{cx) = 1}, 6{{x,Cx),a) = {{y,Cy) \ h{y) = a and dy{cx) = 
Cy}, Vx e Pose U {0} and Va e E. 
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Fig. 3. The c-continuation automaton for E = x*(xx + y)* ■ 



The c-continuation automaton Ce and the position automaton Ve are iso- 
morphic. This property is a corollary of Berry and Sethi results ■ since a c- 
continuation is a particular continuation. The proof can also be directly deduced 
from the following proposition: 

Proposition 3. Let E he a regular expression. The following equalities hold: 

1. Eirst{E) = {y £ Pose \ dy{E) ^ 0}; 

2. Last(E) = {y € Pose \ ^{cy{E)) = 1}; 

3. Eollow{E, x) = {y G Pose \ dy{cx{E)) ^ 0} . 

3.3 The Quotient of the C- Continuation Automaton 

Let ~ be the equivalence relation on the set of states of Ce, defined by: (a;, Cx) ~ 
(y,Cy) h(cx) = h{cy). This relation is proved to be right-invariant, thus the 
quotient automaton Ce! = (Q~, E, z, T, d) is defined as follows: 

Definition 6 (Quotient Automaton). Q".- = {[ca;]|a: G Pose U {0}}, i = 
[co], T = {[cx] I \{cx) = 1}, and [cy] G 6{[cx],a) 3c^ | G [cy], h{z) = 

a and dz{cx) = Cz}, 'dlcx], [cz] G and Va G E. 





Fig. 4. Ce and Ce/~ for E = x*(xx + y)*. 
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From Proposition Jit comes: the quotient automaton Ce! ^ and the equa- 
tion automaton Ee are identical. This result leads to a new construction of the 
equation automaton. 

In the two following sections, we first show how to implement this new con- 
struction with a cubic time and space complexity. Then we present algorithmic 
refinements which lead to a quadratic time and space complexity, thus consid- 
erably improving Antimirov’s algorithm. 



4 Computation of the Set of States of Ce! 

We first describe an O(s^) explicit computation of the list of the c-continuations 
of E. Then we introduce the notion of pseudo c-continuation and show that it 
leads to an O(s^) computation of the set of states of 



4.1 Computing the List of the C- Continuations 

As a consequence of Definition J the following property holds: 

Proposition 4. The c-derivative du{E) of a linear expression E w.r.t. a word 
u of E+ is either 0 or such that: 



du{u) = 1 

du{E + G) = du{E) if du{E) ^ 0, du{G) otherwise 

n = ^fdu{E)^0 

ds{G) otherwise (s £ is some suffix of u) 

du{E*) = ds{E) ■ E* (s £ is some suffix of u) 

This property implies that a c-continuation Ca{E) in a linear expression if is a 
product of distinct subexpressions Eli of if, possibly reduced to a single subex- 
pression or to 1. We now show how this product can be computed over the syntax 
tree T{E) of E. 

Proposition 5. Let E and G be subexpressions of the linear expression E. Let 
f be the mapping such that: 

( E* if E is a son of E* in T{E) 
f{E) = \ C if E is a son of E ■ G in T{E) 

I 1 otherwise 



Let 0 denote the concatenation of a list of expressions. We write E < G if and 
only if G is an ancestor of E. Then for all symbol a in E, we have: 

{E)= O f(H) 

a<H-<E 



C, 
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The proof is by induction on the number of operators in E. The complexity 
of the computation of the c-continuations over T{E) is given by the following 
proposition. 

Proposition 6. Let E be a linear expression of size s and alphabetic width w. 
The list of the c-continuations in E can be computed with an 0{ws^) space and 
time complexity. If E is starfree, the complexity is 0{ws). 

The proof lays on the fact that the size of a c-continuation is bounded by 
s^. Moreover, Aho et al. algorithm [[] or Paige and Tarjan refinement allow 
to sort a list of strings in lexicographic order with a time complexity 0{a -I- k), 
where a is the total sum of the sizes of the strings and k is the size of the 
alphabet. We deduce an O(ws^) space and time procedure to identify equivalent 
c-continuations. Hence the sets of states of Ce and of Cs/~ can be produced in 
0{ws‘^) space and time. 



4.2 Preprocessing of Starred Subexpressions 

According to Proposition^ the computation of the list of the c-continuations is 
only 0{ws) when E is starfree. We therefore preprocess starred subexpressions 
of E, and we compute the list of the c-continuations in the resulting starfree 
expression E' . Notice that the expression E' is not explicitly computed: the 
aim is to substitute their star-names to starred subexpressions involved in a c- 
continuation, and it can be achieved by making use of the labels of star-links. 
Hence the procedure: 



Procedure 
Step 1: 

Step 2: 

Step 3: 



Stars(if: regular expression) 

Process a topdown left to right traversal of T{E), 
and store the starred subexpressions in a list L. 

Sort L in lexicographic order 

and associate the same star-name to identical strings. 
Mark star-links (s. t. f{E) = E*) by the star-name of E*. 



The complexity of this preprocessing is as follows: 

Proposition 7. Let E a linear expression of size s and alphabetic width w. The 
procedure Stars(E) preprocesses the starred subexpressions ofT(E) with an 0{s^) 
space and time complexity. 



4.3 Computing the List of the Pseudo-Continuations 

We call pseudo-continuation w.r.t. a position x the string l^ which deduces from 
the c-continuation Cx by substituting each starred subexpression by its star- 
name. More formally, let us denote by S{H) the star-name of the star expression 
i7, and by s{H) the string associated to the expression H. We have the following 
definition: 
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Definition 7 (Pseudo-Continuation). Let E he a linear expression and Cx 
be the c- continuation w.r.t. x in E. The pseudo-continuation w.r.t. x in E is the 
string lx such that: 

Cx = Hi-H 2- ...Hi 

where Hi is a subexpression of E, 1 <i <l 
lx = ■ A 2 ■ . . . Ai 

, /) _ / if Hi is a star expression 

^ * ( s{Hi) otherwise 

We first show that pseudo-continuations can be substituted to c-continuations 
inside the identification process. Let S'* the alphabet of the star-names. A pseudo- 
continuation lx is a string over the alphabet Y = Pose US*U{0,1,-|-,-}. Let us 
extend h as a mapping from Pose U S* to A by setting h{s) = s for all s in S*. 

Proposition 8. Let x and y in Pose U {0}. Then we have: 

r hi^Cx^ = 

Proof. (=^) Obvious: identical star-names inside strings h{fx) and h{ly) neces- 
sarily have the same expansion in h(cx) and h{cy). 

(<t=) This is due to the fact that it is syntactically impossible for subexpres- 
sions E, G and H to verify H* = E* ■ G* . 



Finally, the list of the pseudo-continuations lx in E and the set of their 
projections on E can be computed by the following procedure: 

Procedure Pseudocontinuations(A: regular expression) 

Step 1: Compute the set of links f{E) in T{E). 

Step 2: Perform the procedure Stars(E). 

Step 3: For each position x of E, construct G. 

Step 4 .’ Sort the projections of the pseudo-continuations. 



The complexity of this procedure deduces from Proposition Q and 
LemmaJ 

Lemma 1. Let E be a regular expression of size s and alphabetic width w. Then 
the alphabet of the pseudo-continuations in E has an 0{w -\- s) size. Pseudo- 
continuations have an 0{s) size. 

Proposition 9. Let E be a regular expression of size s and alphabetic width 
w. The procedure Pseudocontinuations (E) computes the list of the pseudo-con- 
tinuations in E and the set of their projections on E with an O(s^) space and 
time complexity. 

As a consequence the set of states of the quotient automaton Ce/^ can be 
computed with an O(s^) space and time complexity. 
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5 Computation of the Set of Transitions of Ce! 

According to Proposition^ the set of transitions in Ce and consequently the set 
of transitions in Ce! ^ can be deduced from the position sets of E. Let us point 
out that the structure and the properties of c-continuations provide a proof for 
some computational refinements such as the set disjoint unions used to build the 
ZPC structure 

5.1 Transitions in Ce 

Let (x,Cx) be a state of Ce- We consider the set T(x,c^) of the positions asso- 
ciated to the targets of the out-going transitions of (x,Cx): T(^x,c^) = {y I 2/ G 
Pose and dy(cx) yf 0}. 

Let Cx = Hi ■ H 2 ■ ... Hi he the decomposition of Cx as a product of subex- 
pressions of E. The set can be deduced from the First sets of the subex- 

pressions Hi as follows: 

Proposition 10. Consider the integer s, 1 < s < 2, such that \{Hi) = 1 for all 
1 < z < s and X{Hs) = 0. Assume s = I if \{Hi) = 1 for all 1 < i < 1. Then we 
have: 

= U Eirst(Hi) 

l<i<s 

The proof is based on Proposition H 

We now show that the set T(^x,ca,) can be computed as a disjoint union of 
First sets. This fact is based on the following property of c-continuations. 

Lemma 2. Let Cx = Hi • H 2 ■ . . . Hi. Let r be an integer such that 1 < r < I, 
and Hr be a star-link such that Hr = K* = f{K). Then there exists a word u 
such that: Cx = du{K) • K* • Hr+i - - - Hi. 

Proposition 11. Consider s such that X{Hi) = 1 for all 1 < i < s and X{Hs) = 
0 (and s = I if X{Hi) = 1 for alll < i < 1). Let r, 1 < r < s, be the greatest index 
such that Hr is a star-link. We assume r = 1 if there is no star subexpression. 
Then, we have: 

= 1+J Eirst(Hi) 

r<i<s 

Proof. By Proposition^^ for all state {x, Cx) of Ce, T(^x,ca,) = Ui<i<s Eirst(Hi), 
By Lemma H since Hr is a star-link, there exists a word u such that: Cx = 
du{K) . K* . Hr+i . . .Hi. Since Eirst{du{K)) C Eirst{K*), we get Tf^x.ca,) = 
Ur<i<s Eirst(Hi). Moreover, since Hr is the “highest” star-link in Tc„., this last 
union is a disjoint one. 

■ 

Since the collection of the First sets of the subexpressions of E can be com- 
puted with an overall 0{s) time complexity, via linkings over the syntax tree of 
E, as designed in the construction of the ZPC structure we get an O(s^) 

space and time computation of the set of transitions of Ce- 
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5.2 Transitions in Ce! 

Let [{x,Cx)] be a state of Ce ! We consider the set T\^c„] of pairs (a, [cy]) such 
that ([(a;, Cx)], a, [{y, Cy)]) is a transition. 

^[cx] = {(«; [cy]) I y G Pose A dy{cx) ^0 Aa = h{y)} 

The following Proposition is deduced from Proposition^] 

Proposition 12. Consider s such that X{Hi) = 1 for all 1 < i < s and X{Hs) = 
0 (and s = I if X{Hi) = 1 for alll < i < 1). Let r, 1 < r < s, be the greatest index 
such that Hr is a star-link. We assume r = 1 if there is no star subexpression. 
Then, we have: 



^[cx] = 1+J {(«) \oy\) I V G First(Hi) A a = h{y)} 

r<2<s 



We thus get an O(s^) space and time computation of the set of transitions 
ofCE/^. 



6 Conclusion: Comparison with Antimirov’s Construction 

In Antimirov’s algorithm, the successive derivations produce O(s^) expressions, 
each of which is compared to 0{s) distinct partial derivatives. Therefore there 
are O(s^) expression tests. Since the size of a partial derivative is O(s^) the over- 
all complexity is 0(s®). This complexity is improved by an O(s^) factor by the 
computation of the quotient of the c-continuation automaton for the following 
reasons: 

(1) Only the computation of the set of states generates expression tests. The 
computation of the set of transitions is deduced from O(s^) procedures used in 
the construction of the position automaton. 

(2) Each c-continuation can be directly produced over the syntax tree in O(s^) 
time. The set of the projections of the c-continuation can be computed in O(s^) 
time (each projection is compared only to one other expression after a lexico- 
graphic sort). Hence an O(s^) construction of the set of states. 

(3) This construction can be refined by substituting pseudo-continuations to 
c-continuations. The size of a pseudo-continuation is 0{s). The computation 
of the set of pseudo-continuations, which implies a preprocessing of the starred 
subexpressions is in O(s^) time. Hence an O(s^) construction of the set of states. 
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Abstract. A Compact Directed Acyclic Word Graph (CDAWG) is a 
space-efficient text indexing structure, that can be used in several differ- 
ent string algorithms, especially in the analysis of biological sequences. 
In this paper, we present a new on-line algorithm for its construction, 
as well as the construction of a CDAWG for a set of strings. 



1 Introduction 

Several different string problems, like those deriving from the analysis of biolog- 
ical sequences, can be solved efficiently with a suitable text-indexing structure. 
Perhaps, the most widely used and known structure of this kind is the suffix tree, 
that can be built in linear time and permits to efficiently find and locate all the 
substrings of a given string. The main drawback of suffix trees is the additional 
space required to implement the structure. In many applications, like sequence 
analysis and pattern discovery in biological sequences, keeping as many data as 
possible in main memory might provide significant advantages. This fact has led 
to the introduction of more space-efficient structures, like suffix arrays suffix 
eaeti and others. 

In this work, we focus our attention on the Compact Directed Acyclic Word 
Graph (GDAWG), first described in [|j. The GDAWG for a string can be seen 
either as a compaction of the Directed Acyclic Word Graph (DAWG) [|j, or a 
minimization of the suffix tree, from which it can be derived as shown in 
for DAWGs and Q for suffix trees. In the latter case, the basic idea is to merge 
redundant parts of the suffix tree (see Fig.®. Experimental results have 
shown how GDAWGs provide significant reductions of the memory space re- 
quired by suffix trees and DAWGs when applied to genomic sequences. A linear 
time algorithm for the direct construction of the GDAWG of a string is presented 
in 5 , so to avoid the additional space required by the preliminary construction of 

* The results described in this work were reached independently by the Kyushu and 
Milan groups, submitted simultaneously to the conference, and merged into a joint 
contribution. 
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Fig. 1. Suffix tree and CDAWG for string cocoa. Substrings co and o occur as 
prefix of the same suffixes: the corresponding nodes are merged as well as the 
subtrees rooted at the nodes. Leaves are merged into a single final node. 



the DAWG or the suffix tree. The algorithm is similar to McGreight’s algorithm 
for suffix trees □ In this paper, we present a new algorithm for the construction 
of GDAWGs, based on Ukkonen’s algorithm for suffix trees Q The algorithm 
is on-line., that is, it processes the characters of the string from left to right one 
by one, with no need to know the whole string beforehand. Furthermore, we 
show how the algorithm can be used to build a GDAWG for a set of strings, a 
structure first described in where was derived by compacting a DAWG for 
a set of strings. The main drawback of this approach was the fact that, when a 
new string was added to the set, the DAWG had to be built again from scratch. 
Instead, the algorithm we present allows to add a new string directly to the 
compact structure. 

2 Definitions 

Let A be a nonempty finite alphabet, and S* the set of strings over E. If 
s = a/S'y, with a,/ 3 , 7 G E* , then a is a prefix of s, 7 is a suffix of s, and 
a, P, and 7 are substrings (factors) of s. If s = si . . . s„ is a string in E* , |s| 
denotes its length, and s[i..j] its substring Si ... Sj. With Suf{s) we will denote 
the set of all suffixes of s. Let A be a subset of E*. For any string u G E* , 
u~^X = {a; I ux G A}. Given a string s, we define the syntactic congruence on 
E* associated with Suf{s) and denoted by =suf{s) as: 

u =suf(s) V u~^Suf{s) = v~^Suf{s) (for any u,v G E*) 

That is, u and v occur as prefixes of the same suffixes of s. In other words, 
the occurrences of u and v must end at the same positions in the string. Hence, 
if u and v occur in the string, one must be a suffix of the other. As in we 
will call classes of factors the congruence classes of the relation =suf{s)- The 
class of all strings that are not substrings of s is called the degenerate class. The 
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Fig. 2. Implicit CDAWG and CDAWG for string abcab. 



longest string in a non-degenerate class of factors is the representative of the 
class. Given a non-degenerate class of factors C of and its representative 

u, if there are at least two characters a,b € E such that ua and ub are substrings 
of s, then C is a strict class of factors of =suf{s)- From now on, we will say that 
two substrings are strictly congruent if they belong to the same strict class of 
factors. We are now ready to give a formal definition of a GDAWG. 

Definition 1. The compact directed acyclic word graph (CDAWG) of a string 
s is a directed acyclic graph, where: 

1. two distinct nodes are marked as initial and final; 

2. edges are labeled with non empty substrings of s; 

3. labels of two edges leaving the same node cannot begin with the same char- 
acter; 

4- every suffix of s corresponds to a path on the graph starting from the initial 
node and ending at a node, such that the concatenation of the edge labels on 
the path exactly spells the suffix. From now on, we will call a node corre- 
sponding to a suffix of s terminal node; 

5. substrings spelled by paths starting from the initial node and ending at the 
same non-terminal node of the graph belong to the same strict class of fac- 
tors. 

The GDAWG of a string s has at most |s| -I- 1 nodes and 2|s| — 2 edges | ’ ■ [ 
According to the definition of a strict class of factors, non-terminal nodes must 
have at least two outgoing edges. We will denote with {p, a, q) the edge p ^ q 
of the graph labeled with substring a. The following definitions will be useful 
throughout the paper: 

Definition 2. The implicit CDAWG of a string s is a CDAWG where nodes 
with outdegree one are removed, and each edge entering a node with outdegree 
one is merged with the edge leaving it. 

In the implicit GDAWG of a string s, the suffixes of s are spelled out by paths in 
the graph starting at the initial node, but not necessarily ending at a node. An 
example is shown in Fig. jFor every node p, let length ,.{p) be the length of the 
longest substring spelled by a path from the initial node to p. Edges belonging 
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to the spanning tree of the longest paths from the initial node are called solid 
edges. In other words, an edge {p,a,q) is solid iff lengthg{q) = lengthg{p) + |a|. 
Finally, we assume that the label of each edge is implemented with a pair of 
integers denoting the starting and ending points in the string of the substring 
corresponding to the label, and every node is annotated with the length of the 
longest path from the initial node. 

3 Construction of the CDAWG for a Single String 

Given an alphabet if, let s = si . . . s„ be a string on E. Our algorithm is 
divided in n phases, building at each phase i the implicit CDAWG Qi for each 
prefix s[l..z] of s. More in detail, the implicit CDAWG Qi+i for s[l..z+ 1] is 
constructed starting from graph Qi for s[l..z]. Each phase z+ 1 is divided in z+ 1 
extensions, one for each of the z + 1 suffixes of s[l..z+ 1]. In extension j of phase 
z + 1, the algorithm finds the end of the path from the initial node labeled with 
substring s[j..z], and extends it by adding character Si+i to the path, unless it 
is already there. Therefore, in phase z + 1, substring s[l..z + 1] is first put on 
the graph, followed by s[2..z+ 1], s[3..z+ 1], and so on. Extension z + 1 of phase 
z + 1 adds the single character Si+i after the initial node. The initial graph Q\ 
has one initial node / and one final node F, connected by an edge labeled by 
character si. The algorithm can be sketched as follows: 

1. Construct graph Qi 

2. For z from 1 to zz — 1 do 

3. For j from 1 to z + 1 do 

4. Find the end of the path from / labeled s[j..z] 

5. Add character s^+i if needed 

6. End for 

7. End for 

At extension j of phase z + 1, once the end of the path spelling s[j..z] has been 
located, the CDAWG can be updated according to three different rules: 

1. In the current graph, the path spelling s[j..i] ends in F. To update the graph, 
character s^+i is appended to the label of the edge entering F. 

2. The path corresponding to s[j..i] does not continue with s^+i, but continues 
with at least one character c. If the path ends at a node p, we create a new 
edge {p, Si+i, F). Otherwise, we create a new node q at the end of the path, 
splitting the edge in two at the point where the path ends. Then, we create 
a new edge (g, Si+i, F). 

3. Some path at the end of s[j..i] continues with s^+i. In this case, substring 
s[j..z+ I] is already in the current graph: we do nothing (hence the implicit 
graph) . 

These rules, however, do not guarantee that at the end of the phase we cor- 
rectly constructed a CDAWG. In fact, the algorithm must also check whether a 
substring strictly congruent to another one has been encountered, or, conversely. 
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Fig. 3. Implicit CDAWG for string abcaba before (left) and after redirection of 
an edge, at phase 6, extension 5. Node 1, labeled ab, was created at the previous 
extension, after the insertion of a at the end of the path labeled ab. Now, path 
corresponding to b is found ending in the middle of non-solid edge (/, bcaba, F), 
that is redirected to node 1 and becomes (/, 6, 1). 



whether a substring has to be removed from a strict class of factors, so that at 
the end of phase i -I- 1 paths ending at the same node correspond to strict classes 
of factors of and vice versa. Here we sketch how the algorithm has to be 

modified. A more detailed description of the algorithm and its implementation 
can be found in H 



Detecting Strictly Congruent Factors. Two substrings a and (3 belong to 
the same class C iff they are prefixes of the same suffixes, and there are at least 
two characters a,b G S such that aa, ab, (3a, and (3b occur in s. Moreover, 
a must be a suffix of (3, or vice versa. We suppose w.l.o.g. that a = c(3, with 
c G E. We also assume that a and (3 have occurred just once, that substrings aa 
and (3a have been put in the graph in some previous phase (in two consecutive 
extensions), and in the current extension we have to insert ab. The path spelling 
a ends in the middle of an edge, and the next character on the edge is a. A new 
node p is created at the end of the path, as well as a new edge (p, b, F). At the 
following extension, we have to locate (3 in the graph. If (3 has occurred only once 
(together with a), it now belongs to the same strict class of factors, and we end 
in the middle of a non-solid edge that continues with a. In this case, we redirect 
the edge to p, labeling it with the part of the label that was contained in the 
path of (3 (see Fig. Since there can be more than two consecutive substrings 
to be assigned to the same class, it is possible that we again end along non-solid 
edges in the following extensions. In this case, we redirect the non-solid edges 
to p as well, until we reach an extension where we end at a node or along a 
solid edge. Otherwise, if (3 had previously occurred also by itself, either the path 
corresponding to (3 ends at a node {(3 has been followed by characters different 
from a), or the edge we end on is solid {(3 had been followed only by a). In the 
former case, if there is not an edge labeled b leaving the node we create a new 
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Fig. 4. CDAWG for string abcabdb at phase 7, extension 7. Character b is found 
at the end of the non-solid edge (/, 5, 1). At extension 6, the path spelling db 
ended at the final node. Thus, b has to be removed from the class associated 
with node 1, that is cloned into node 2. Edge (/, b, 1) becomes (/, b, 2). 



edge labeled b to the final node. In the latter case, we create a new node and 
connect it to the final node with an edge labeled b. Then, there may be again 
non-solid edges that have to be redirected into the newly created node. 



Splitting a Strict Class of Factors. Conversely, a substring that has been 
assigned to a strict class of factors has to be removed from the class if it does 
not occur as a suffix of the representative when a new character s^+i is added to 
the string. Let a and (3, a = cf3, be the two substrings assigned to the same class 
in the previous example. Now, suppose that in phase i + 1 we have to insert (3 in 
the graph. In this case, s^+i is the last character of (3, and we find it at the end 
of the edge entering node p, that is non-solid, since (3 is not the representative 
of the class. Now we have two cases: s^+i was found at the end of an edge that 
entered node p also at the previous extension, or we ended up somewhere else. 
In the former case, we had also inserted a at the previous extension of the same 
phase, therefore (3 still belongs to the same class. In the latter, we have detected 
an occurrence of j3 not preceded by a, that is, not as a suffix of a, and we have 
to remove it from the class. To reflect this in the graph, we clone the node p 
into a new node q, and redirect the non-solid edge to q keeping the same label. 
The redirected edge becomes solid. An example is shown in Fig. ^ If also some 
suffixes of (3 had been previously assigned to the same class as /?, in the following 
extensions we will again find at the end of a non-solid edge entering p. These 
edges are redirected to q. It can be proved that it suffices to check only the last 
edge on each path to ensure that a class has to be split. No cloning takes place 
if a character is found at the end of an edge entering the final node. 

The two observations outlined above can be implemented in the algorithm by 
modifying Rules^ind^^ccordingly. It is worth mentioning that both redirection 
of edges to a newly created node and node cloning can take place during the 
same phase. An example is shown in Fig. J 
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Fig. 5. From left to right, CDAWG for string abcabb at phase 6, extensions 5, 6, 
and 7. Character b is put in the graph after substring ab, and the path spelling 
b is found in the middle on non-solid edge {I, bcabb, F) (left) that is redirected 
to node 1 (center). Then, at extension 7 (that adds b after the empty string) b is 
found at the end of a non-solid edge. Node 1 is thus cloned into node 2 (right). 



3.1 Using SufRx Links 

Naively, locating the end of s[j..i] in extension j of phase i+1 would take 0{i—j) 
time by walking from the initial node and matching the characters of s[j..i\ along 
the edges of the graph. This would lead to an overall O(n^) time complexity for 
the construction of the whole graph. We will now reduce it, as in Q, to 0(n) by 
introducing sujjix links and with some remarks. 

Definition 3. Let p be a node of the graph, different from the initial or final 
node. Let (3 be the representative of the elass associated with p. The suffix link 
of p, denoted by L{p), is the node q whose representative 7 is the longest suffix 
of f3 whose path does not end at p. 

The suffix link of a node p can be implemented with a pointer from p to Lfp). 
If 7 is empty, then L{p) is the initial node. Suffix links are not defined for the 
initial and the final node. Although the definition does not guarantee that every 
node in the graph has a suffix link, we can prove the following: 

Lemma 1. Any node created during phase i -I- 1 will have a suffix link from it 
by the end of the phase. 

Proof. In extension j of phase i -I- I a new node p can be created at the end of 
the path spelling substring s[j..i] by application of Rule^or by cloning. In the 
former case, L(j>) will be the first node to be created or encountered at the end 
of the path corresponding to a suffix of s[j..i] (possibly after edge redirections). 
Such a node always exists, since the last extension locates the empty suffix at 
the initial node. In the latter case, let us suppose that a node q is cloned into 
node p with path spelling s[j..i + 1]. Substring s[j..i + 1] is the longest suffix of 
the representative of q that does not belong to the same class. Thus, L{q) is set 
to p. Suffix link L(jp) is left undefined until one of the suffixes of s[j..i + 1] ends 
at a node other than p (that again could be I) . □ 
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Fig. 6. A suffix link. Node p corresponds to class a/3, node q corresponds to (3. 
Paths labeled with suffixes of a(3 longer than /3 end at p. If at some extension 
j character s^+i is added after a/ 37 , then extensions from j + 1 to j + \a\ are 
implicitly performed as well. 



During any phase, the only node of the graph other than the initial and the 
final without a suffix link from it is the last created one. Let us suppose that 
the algorithm has completed extension j of phase / + 1. Suffix links are used to 
speed up the search for the remaining suffixes of s[j../]. Starting from the end of 
s[j..i\ in the graph, we walk backwards along the path corresponding to s[j..i\ 
up to either the initial node or a node p that has a suffix link. This requires 
traversing at most one edge. Let 7 be the concatenation of the edge labels of 
the path from p to s[j../]. If p is not the initial node, we move to node L(p) and 
follow from it the path spelling 7 . Otherwise, we search for s[j + 1../] starting 
from I. Finally we add s^+i according to one of the extension rules, redirecting 
an edge or cloning a node if needed. Notice that, if node p is the end of / > 2 
different paths, the position reached after searching from 7 from L(p) will be 
the end of path s[j + /../], that is, extensions from j + 1 to j + / — 1 have been 
implicitly performed at extension j. 

A path spelling 7 starting from L{p) always exists, since all the suffixes of 
s[j..i] are already in the graph. Thus, to find the path spelling 7 the algorithm 
just matches the first characters on the edges encountered. To obtain a linear 
time algorithm, we need just two more “tricks” . 

Remark 1. When during any extension Rule^is applied, that is, a given sub- 
string s[j..i -k 1 ] is already on the graph, then the same rule will apply to all 
further extensions, since all the suffixes of s[j..i + 1 ] are already in the graph as 
well. Therefore, once RuleHis applied (and no node has to be cloned or edges 
redirected), we can stop and move on to the next phase, since all the strings to 
be inserted are already in the graph and no adjustment is needed for the classes. 



Remark 2. If a new edge is created entering the final node during extension j 
of any phase /, then RuleH’'^ill always apply at extension q in any successive 
phase. That is, new characters will always be appended at the end of the last 
edge in the path associated with s[_/..i], that will enter the final node. Thus, when 
a new edge is created entering the final node with label s[j..i + 1 ], we label it 
with integers h and e (j < ft. < z-l- 1 ), where e denotes the current phase, that is. 
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the current end position in the string. If we implement e with a global variable, 
and set it to i -I- 1 at the beginning of each phase z -I- 1 , we perform implicitly all 
the extensions that would end up at the final node. 



Every phase i starts with a series of applications of Rules Hand H that put 
Si at the end of an edge entering the final node; when Rule His applied for 
the first time, it will be also applied to all further extensions. Now, let ji be 
the first extension where Rule His applied with cloning in phase i, and j* the 
first extension where it is applied without edge redirection to the cloned node. 
Extensions ji -I- 1 to j* — 1 will redirect edges to the last node created. Extensions 
from j* -|- 1 to z need not to be performed, since in each of them we would not 
do anything. In phase z -I- I, all extensions from 1 to ji — 1 will apply RuleH 
therefore they are implicitly performed by setting the counter e to z -I- 1. Thus, 
we can start phase z-l- 1 directly from extension j* — 1, until we find an extension 
where RuleH^s applied without cloning or edge redirection. This can be done by 
starting phase z-l- 1 from the position in the graph of the last suffix of s[l..z] that 
had to be redirected to the cloned node. This took place at extension j* — 1. The 
first extension in phase z -I- 1 will have to look for Si+i exactly at the endpoint 
of the last extension of phase z. This will also implicitly perform all extensions 
from ji to j* — 1. Of course, if in phase z RuleH^s first applied without cloning 
we can move on to phase z -I- 1 as well. 

The algorithm does not need to know which extension is currently perform- 
ing. That is, it starts phase z -I- 1 from the endpoint of phase z, adding s^+i. 
Then it starts moving in the graph by using suffix links, and adding s^+i at the 
end of each path. If the backward walk ends at I, and 7 = 7i . . . 7 fc is the label 
of the path traversed, then it looks for the path labeled 72 ■ ■ ■ 7fc • Phase z -I- 1 
ends when the algorithm applies for the first time Rule H’''^itliout node cloning 
or edge redirection. Moreover, whenever we find s^+i at the end of a non-solid 
edge, we no longer have to check what happened at the previous extension, and 
just clone the node. In fact, if the representative of the class had been met during 
one of the previous extensions, we would have stopped the phase at that point, 
without reaching the current extension. 

At the end of phase n, we have constructed the implicit CDAWG for string 
s. In order to obtain the actual CDAWG, we perform an additional extension 
phase n+1, extending the string to a dummy symbol $ that does not belong to 
the string alphabet. Anyway, we do not increment the phase counter e to rz -I- 1, 
so to avoid appending $ to edges entering the final node. Moreover, whenever 
a new node p has to be created, we do not add the edge {p, $, E) to the graph. 
Nodes created in this phase will thus have outdegree one, and will correspond to 
terminal nodes of the CDAWG. Notice that, whenever a path s[j..n] ends along 
an edge, we always create a new node and mark it as terminal, while cloning 
of nodes and redirection of edges work as in the previous phases. When a path 
s[j..n] ends at a node, we mark the node as terminal. At the end of the additional 
phase, the implicit CDAWG has been transformed into the actual CDAWG for 
string s. An example of the on-line construction of a CDAWG is shown in Fig. H 
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Fig. 7. From left to right, construction of the CDAWG for string abcabcbcd: at 
the end of phase 6 (implicit CDAWG for string abcabc); at the end of phase 7 
{abcabcb, where abc, be, and c belong to the same strict class of factors); at the 
end of phase 8 (abeabebe, where be and c have been removed from the class with 
representative abe); the final structure. Stars indicate the position in the graph 
reached at the end of the last explicit extension of each phase. 



With arguments analogous to Ukkonen’s algorithm for suffix trees, we can prove 
the following: 

Theorem 1. Given a string s = s\ . . . Sn over a finite alphabet S, the algorithm 
implemented with sujjix links and implicit extensions builds the CDAWG for s 
in 0{n) time and 0{n\E\) space if the graph is implemented with a transition 
matrix, or in 0{n\E\) time and 0{n) space with adjacency lists. 

Proof (Sketch). The operations performed in any explicit extension (creation or 
cloning of nodes, edge redirections), that is, extensions that are not performed 
implicitly by incrementing the e counter, take constant time. Let j* the last 
explicit extension performed at phase i, and A+i the first explicit extension 
performed at phase z + 1. In the worst case, we have jt+i = j* — 1. Moreover, 
for each i, ji < ji+i. Thus, at most 3n explicit extensions are performed by the 
algorithm. At any extension j of phase i, to locate the endpoint of s[j..i] the 
algorithm walks back at most one edge from the endpoint of s[j — l..z], follows 
a suffix link, and then traverses some edges checking the first symbol on each 
edge. If the graph is implemented with a transition matrix, traversing an edge 
takes constant time. Else, it takes 0{\E\) time. The only thing unaccounted for 
is the overall number of edges traversed. For every node p of the graph, let the 
node depth of p be the number of nodes on the path from the root to p labeled 
with the representative of the class associated with p. As in Q, the sum of the 
node depths counted during all the explicit extensions is reduced at most by 
0(n), and since the maximum node-depth is n, the maximum number of edges 
traversed is bounded by 0{n). □ 
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4 The CDAWG for a Set of Strings 

The basic idea of the CDAWG for a set of strings S = {s^, . . . , s^} is the same 
of the single string structure. Now, the nodes of the structure correspond to 
patterns that occur as prefix of the same suffixes in every string of the set. In 
other words, given Suf{S) (the set of the suffixes of the k strings), the nodes 
of the CDAWG correspond to strict classes of factors for =suf{S)- The only 
difference is that now we have k final nodes Fi . . .Fk, one for each string, and 
we want all the suffixes of s* to end at the corresponding final node Fi. This result 
can be obtained by appending a different termination symbol, not belonging to 
the string alphabet, to each string of the set. More formally: 

Definition 4. The CDAWG for a set of strings s^ ... s^ is a directed acyclic 
graph, with a node marked as initial and k distinct nodes F\ . . . F^ marked as 
final. Edges are labeled with non empty substrings of at least one of the strings. 
Labels of two edges leaving the same node cannot begin with the same character. 
For every string s® in the set, all suffixes of s® are spelled by patterns start- 
ing at the initial node and ending at node Fi. Paths ending at non final nodes 
correspond to strict classes of factors of the congruence relation =suf{S)- 

The CDAWG for a set of strings can be constructed with the algorithm 
presented in the previous section. First, we build the CDAWG for string (with 
the termination symbol) and final node F\. Notice that, since the termination 
symbol does not occur anywhere else in s^, the resulting structure is a CDAWG, 
with no need to perform the additional phase. Then, string is added to the 
graph, but in this case with final node F 2 . The same will apply to every other 
string in the set. Node cloning and edge redirection rules ensure the correctness 
of the resulting structure. It can be proved that the algorithm takes 0{N) time 
to construct the structure, implemented with a transition matrix, where N = 
I®* I- This structure (with marginal differences) was first described in 
where it was built by reducing a DAWG. Therefore, adding a new string to 
the set required the construction of a new DAWG from scratch. The algorithm 
presented here, instead, permits to add strings directly to the compact structure 
(see Fig. Q. As in flwe can give an upper bound on the size of the structure. 

Theorem 2 (Blumer et al., ED The CDAWG for a set of strings s^ ... s^ , 
has at most iV -|- fc nodes, where N = l®*l- 

5 Conclusions 

A GDAWG is a space-efficient text-indexing structure that represents all the 
substrings of a string. We presented a new on-line algorithm for its construction, 
as well as the construction of a GDAWG for a set of strings. The same structures 
can be computed by reduction starting from the corresponding DAWGs or suffix 
trees; however, the approach presented in this paper permits to save time and 
space simultaneously, since the GDAWGs can be built directly. Moreover, once 
the structure has been built for a set of strings, new strings can be added directly 
to the compact structure. 
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Fig. 8. CDAWG for strings ababc%i and abcab%2, after the insertion of ababc%i 
(left) and abcab%2 (right). Characters $i and $2 are used as terminations. Edges 
(/, $1, Fi) and (/, $2, F2) have been omitted. 
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Abstract. We present a linear-time algorithm to compute the longest 
common prefix information in suffix arrays. As two applications of our 
algorithm, we show that our algorithm is crucial to the effective use of 
block-sorting compression, and we present a linear-time algorithm to sim- 
ulate the bottom-up traversal of a suffix tree with a suffix array combined 
with the longest common prefix information. 



1 Introduction 

The suffix array Q is a space-efficient data structure that allows efficient search- 
ing of a text for any given pattern. The suffix array is basically a sorted array 
Pos of all the suffixes of a text. A suffix array for a text of length n can be 
built in 0(n log n) time, and searching the text for a pattern of length m can 
be done in 0(m log n) time by a binary search. When a suffix array is coupled 
with information about the longest common prefixes (Icps) of some elements in 
the suffix array, string searches can be speeded up to 0{m + logn) time. The Icp 
information is usually computed during the construction of suffix arrays 
In some cases, however, the Icp information may not be readily available. 

In this paper we consider the Icp problem in sujjix arrays that is to compute 
the Icp information from a text and its Pos array, and present a linear-time 
algorithm for the problem. We also describe two applications of our algorithm, 
i.e., block-sorting compression and the substring traversal problem. 

The block-sorting algorithm Q is a text compression method with good bal- 
ance of compression ratio and speed. The original text can be decoded in linear 
time from block-sorting compression. An advantage of block-sorting compression 
is that the suffix array of the original text can also be obtained in the process of 

* This work was supported by the Brain Korea 21 Project. 
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Fig. 1. An example of the suffix tree and the suffix array. 



decoding. This means that we can compress a text and its suffix array together 
by simply using the block-sorting algorithm. This fact can be used for storing 
and transferring large full-text databases. However, the Icp information that is 
necessary for efficient searching is not obtained during the decoding of block- 
sorting compression. With our algorithm, block-sorting compression can be used 
more effectively to store a text and its suffix array. 

The substring traversal problem is to enumerate all branching substrings ap- 
pearing in a given text. Although the problem is easily solvable by a bottom-up 
traversal of the suffix tree, recent large scale applications in bioinformatics and 
data mining require a more practical and scalable solution for the problem Q. 

We present a simple linear-time algorithm that simulates the bottom-up 
traversal of a suffix tree with a suffix array combined with the Icp information. 
Our algorithm is space-efficient and I/O-efhcient, i.e., it requires only 7n bytes 
including the text while the suffix tree requires at least 15n bytes, and it has a 
good I/O complexity of hnjB blocks. Furthermore, the algorithm can be modi- 
fied to solve a class of problems based on the occurrence count of each branch- 
ing substring, which include the longest common substring problem the 
square/tandem repeat problem and the frequent/optimal substring prob- 
lem ^^0. Experiments on English text data show that our proposed algorithms 
run efficiently in practice. 



2 Preliminaries 

Let A = ai 02 • ■ • a„_i$ be a text of length n> 1. In what follows, we assume 
that A ends with a special end marker $ that does not appear in other positions. 
Let Ai denote the suffix of A that starts at position i. For a substring S of 
A, we denote by Occ{S, A) the set of all occurrences of S in A. Let =a be an 
equivalence relation on substrings defined as follows: For any substrings S,S', 
the relation S =a S' holds if and only if Occ{S, A) = Occ{S' , A). A substring S 
of A is branching if S is the longest common prefix of distinct suffixes Ai and 

^3 
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The suffix array of a text A is a sorted array Pos\l..n] of all the suffixes 
of A, i.e., Pos[k\ = i if Ai is lexicographically the fc-th suffix. The suffix tree 
^3 is a data structure for storing all branching substrings of A, which is the 
compacted trie ST for all suffixes of A. The suffix tree has at most 2n — 1 nodes 
and can be stored in 0{n) space. The suffix tree ST of a text of length n can 
be constructed in 0{n) time The suffix array Pos of A coincides with 

the list of the leaves of ST ordered from left to right. In Fig. J we show the 
suffix tree and the suffix array of string A = abcabbcaS. We denote by str(v) 
the substring of A obtained by concatenating the labels on the path from the 
root to V. The following lemma is well known 

Lemma 1. Let S be any substring of A. Then, the following 1-3 are equivalent. 

1. S is branching. 

2. S is the unique longest member of the equivalence class of S w.r.t. =a. 

3. S = str{v) for some internal node v of the suffix tree of A. 

We denote by lcp{A, B) the length of the longest common prefix between 
strings A and B. The Icps between suffixes that are adjacent in the sorted Pos 
array are denoted by an array Height: Height[k\ = lcp{Apos[k-i]^ Apos[k]) for 
2 < k < n. All the necessary Icps for 0{m + log n) search (called arrays Llcp and 
Rlcp in can be computed easily in 0{n) time from array Height 
Therefore, we define the Icp problem as follows. 

Definition 1. The Icp problem in suffix arrays is to compute the Height array 
from a text A and its suffix array Pos. 

For any substring S of A, the suffix array Pos gives a compact representation 
of all occurrences of S. The set of all occurrences of S occupy a contiguous 
interval [L,R] C n}, namely, Occ{S,A) = {Pos[k] : L < k < R}. We 

call the pair (L, R) the rank interval of S. Then, the triple for S is the triple 
(L,R,H) of integers, where (L,R) is the rank interval of S and H = IS”! is 
the length of S. If necessary, the substring can be immediately obtained by 
S = A[Pos[L]..Pos[L] + H -1]. 

A bottom-up traversal of the suffix tree is any list L of its nodes such that 
each node appears exactly once in L, and a node appears in L only after all 
of its children appear. The post-order traversal Q is an example of bottom-up 
traversals. A bottom-up substring traversal of A is a list L of the triples (L, R, H) 
for all branching substrings of A which is generated by a bottom-up traversal of 
the suffix tree of A. Then, the substring traversal problem is stated as follows. 

Definition 2. The substring traversal problem is to compute the substring 
traversal L for a text A. 

This problem is linear time solvable by a post-order traversal of the suffix 
tree ST. Unfortunately, it is difficult to solve this problem with the suffix array 
Pos alone because Pos has lost the information on tree topology. The array 
Height has the information on the tree topology which is lost in the suffix array 
Pos. 
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Fig. 2. An example of sorted suffixes and Icps. 



Lemma 2. A substring S of a text A is branching if and only if there exists 
some rank 1 < k < n such that S is the longest common prefix of the adjacent 
suffixes and ApQg]^}^^. 

From Lemma^ we can compute the list of all branching substrings associ- 
ated with the in-order traversal of ST simply by reporting A[Pos[k]..Pos[k] + 
Height[k] — 1] for every rank 1 < fc < n. Unfortunately, the obtained list may 
contain duplicates since ST is not a binary tree. Furthermore, there is no obvi- 
ous way to compute either the associated rank intervals (L, R) or the post-order 
traversal. 

3 Linear-Time Icp Computation 

In the Icp computation, we will use an intermediate array Rank. The array Rank 
is defined as the inverse function of Pos, and it can be obtained immediately 
when the Pos array is given: If Pos[k] = i, then Rank[i] = k. 



3.1 Properties of Icp 

The Icp between two suffixes is the minimum of the Icps of all pairs of adjacent 
suffixes between them on the Pos array tt;-| . That is, 

/cp( ApogFa;] , ApogUl ) = min {^Cp(Apogry_]^l , ApogTyl)}. 

x<.y<z 

This implies that the Icp of a pair of adjacent suffixes on Pos is greater than or 
equal to the Icp oi & pair of suffixes that surround them. 

Fact 1. ^cp(Apos[y_x] , Apos[y] ) ^ lcp{ApQg^x]i -^Pos[z]^i x y Si ^ • 

When the Icp between a pair of adjacent suffixes on Pos is greater than 1, 
the lexicographical order of the suffixes is preserved when the first character of 
each suffix is deleted. 
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Fact 2. If lcp{Apos[x-i], Apos[x\) > 1, then 

Rank[Pos[x — 1] -I- 1] < Rank[Pos[x\ + 1]. 

In this case, the Icp between ^Pos[a;-i]-i-i and Apos[x]+i is one less than the 
Icp between Apos[x-i] and Apos[x]- 

Fact 3. If lcp{Apos[x-i], Apos[x]) > 1, then 



, j4pos[3,]_|_l) ^Cp{ApQg^x—l]T ApQg^x]^ 1- 

Now we consider the following problem: compute the Icp between a suffix Ai 
and its adjacent suffix on Pos when the Icp between Ai-\ and its adjacent suffix 
is known. For notational convenience, let p = Rank[i — 1] and q = Rank[i]. Also 
let j — 1 = Pos[p— 1] and k = Pos[q—l]. See Fig.fl That is, we want to compute 
Height[q] when Height[p] is given. 

Lemma 3. If lcp{Aj-i, Ai-i) > 1 then lcp{Ak, Ai) > lcp{Aj,Ai). 

Proof. Since lcp{Aj-i, Ai-i) > I, we have Rank[j] < Rank[i] by Fact 2. Since 
Rank[j] < Rank[k] = Rank[i] — 1, we get lcp{Ak, Ai) > lcp{Aj, Ai) by Fact I. 

Theorem 1. If Height[p] = lcp{Aj-i, Ai-i) > 1 then 

Height[q] = lcp{Ak, Ai) > Height[p] — I. 



Proof. 



lcp{Ak, Ai) > lcp{Aj,Ai) (by LemmaJ 

= lcp{Aj-i, Ai-i) — 1. (by Fact 3) 

By Theorem I, when the Icp between suffix Ai-i and its adjacent suffix is h, 
suffix Ai and its adjacent suffix on Pos has a common prefix of length at least 
h — 1. Therefore, it suffices to compare from the h-th characters for computing 
the Icp between suffix Ai and its adjacent suffix. If h is less than or equal to 1, 
we will compare from the first characters. 

3.2 Algorithm and Analysis 

We now present the algorithm GetH eight that solves the Icp problem in suffix 
arrays. By Theorem 1, we do not need to compare all characters when we com- 
pute the Icp between a suffix and its adjacent suffix on Pos. To compute all the 
Icps of adjacent suffixes on Pos efficiently, we examine the suffixes from A\ to 
An in order. 

Theorem 2. Algorithm GetH eight computes array Height in 0{n). 
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Algorithm GetHeight 

input: A text A and its suffix array Pos 

1 for i:=l to n do 

2 Rank [Pos [i]] := i 

3 od 

4 h:=0 

5 for i:=l to n do 

6 if Rank [i] > 1 then 

7 k := Pos [Rank [i] -1] 

8 while A [i+h] = A[j+h] do 

9 h := h+1 

10 od 

11 Height [Rank [i] ] := h 

12 if h > 0 then h := h-1 fi 

13 fi 

14 od 



Fig. 3. The linear-time algorithm for the Icp problem. 

Proof. The correctness of GetHeight follows from previous discussions. The ex- 
ecution time of the algorithm is proportional to the number of times line 9 is 
executed, since line 9 is the innermost loop of GetHeight. The value of h in- 
creases one by one in line 9, and it is always less than n due to the end marker 
$. Since the initial value of h is 0 and it decreases at most n times in line 
12, h increases at most 2n times. Therefore, the time complexity of Algorithm 
GetHeight is 0{n). 



4 Application to Block-Sorting Compression 

4.1 Block-Sorting Compression 

The block-sorting algorithm is a text compression method with good balance of 
compression ratio and speed It achieves speed comparable to dictionary 

compressors, but obtains compression close to the best statistical compressor. 
The block-sorting algorithm is used in bzip2 

The encoder of block-sorting consists of three processes: the Burrows- Wheeler 
transformation, move-to-front encoding and entropy coding. The Burrows- 
Wheeler transformation (BWT) is the most time-consuming process. It trans- 
forms a string A of length n by forming the n rotations (cyclic shifts) of A, 
sorting them lexicographically, and extracting the last character of each of the 
rotations. A string L is formed from these characters, where the i-th character 
of L is the last character of the i-th sorted rotation. In addition to L, the BWT 
computes the index I of the original string A in the sorted list of rotations. Fig.J 
is an example of BWT where A=’abraca’. A move-to-front encoding encodes an 
instance of a character eh by the count of distinct characters between itself and 
the previous occurrence of ch. 
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Fig. 4. An example of the Burrows- Wheeler transformation. 



As a result of BWT, the locality of characters of L goes higher than that of A 
So, when applied to the string L, the output of a move-to-front encoder will 
be dominated by low numbers, which can be effectively encoded with Huffman 
coding or run-length coding. 

The decoder of block-sorting is the reverse of the encoder. Decoding speed of 
an entropy code depends on the used method, but Huffman coding or run-length 
coding, which is generally used for encoding, can be reversed in linear time. A 
move-to-front code can be reversed in 0(n) time, and the original string A, the 
reverse of the BWT, can be recovered from L and I in 0{n) time. Therefore, 
the block-sorting decompression takes linear time in general. 



4.2 Block-Sorting and SufRx Arrays 

The first step of block-sorting, the BWT, is similar to the construction process of 
a suffix array. The BWT takes much time for sorting the suffixes. However, its 
reverse transformation from L and / to A is quickly computed in linear time by a 
radix-sort-like procedure. Moreover, the suffix array Pos of A can be computed 
immediately when the compressed text is decoded. 

To search for a pattern using the suffix array more efficiently, the Icp infor- 
mation {Llcp and Rlcp) is required. The Icp information can be computed in 
O(nlogn) time when the suffix array is constructed from the original text A. 
With our algorithm, the Icp information can be computed in 0{n) time from 
the original text A and its Pos array. Therefore, suffix arrays can be stored and 
used efficiently by the block-sorting compression. 

Since block-sorting has the effect of storing the compressed text and its suffix 
array, it can be used for storing and transferring large data. Sadakane and Imai 
presented a cooperative distributed text database management method unifying 
search and compression based on BWT |^. Sadakane also presented a modified 
BWT for case-insensitive search with the suffix array 13- Recently, Sadakane 
proposed a compressed text database system (3 based on the compressed suffix 
array ^3- Ferragina and Manzini proposed a data structure that supports 
search operations without uncompressing the block-sorting compression. 
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Algorithm BottomUpTraverse ; 

input : An ordered and compacted tree T 

with n > 0 leaves h, . . . ,£^. 

1 S:={T} /* Initialize the stack S */ 

2 for k:=l to n+1 do /* fc-th stage */ 

3 V lca(4-i,4); 

4 while (depth(top(S)) > depth(v)) do 

5 V Pop(S) and report v; od; 

6 if (depth(top(S)) depth(v)) then 

7 Push(v,S); fi; 

8 Push(l’ii, S) ; /* Set Sk = S */ 

9 od /* for-loop */ 



Fig. 5. An example of the right- Fig. 6. The algorithm to compute the post- 
most branch decompositions. order traversal of an ordered tree. 



5 Bottom-Up Traversal of SufRx Trees 

5.1 Properties of the Post-Order Traversal 

An ordered tree T is compacted if every internal node of T has at least two 
children. Let T be an ordered and compacted tree with n > 0 leaves £\, . . . ,£n- 
In what follows, a path in T is always written in the upward direction. That is, 
a path (or upward path) is a sequence tt = (vq,vi, . . .,Vm) {m > 0) of nodes 
in T such that vi is the parent of vt-i for every 1 < z < m. The length of tt 
is |7 t| = m. A path tt from the fc-th leaf (1 < fc < n) to the root is called the 
fc-th branch of T and denoted by T:{£k)- A node-depth of a node v, denoted by 
depth{v), is the length of the path from v to the root. We write u < v {u ^ v) if 
a node u is an ancestor (proper ancestor) of node v. We denote by lca{u, v) the 
lowest common ancestor of nodes u and v and by tt(£) the branch starting at 
a leaf £. Let £ be any leaf. A rightmost branch (RM branch, for short) starting 
with £, denoted by n{£), is the longest branch tt = {vq = £,vi, . . . , Vm) {'m > 0) 
starting at £ that consists of only rightmost edges, that is, Vi-i is the rightmost 
child of Vi for every 1 < z < m. 

n{£k) is called the k-th RM branch. Since the set {U (£i) , . . . , U {£„)} of 
all RM branches of T is called the RM branch decomposition of T since it is a 
partition of T. Fig.^shows an example of the RM branch decompositions, where 
each shadowed line indicates an RM branch (See below for the special node T). 

Lemma 4. The post- order traversal of an ordered tree T equals the concatena- 
tion n{£i) ■ ■ ■ n{£n) of the RM branches of T from left to right. 



5.2 Algorithm for Bottom-Up Substring Traversal 

From now on, we consider a method to compute the post-order traversal of an 
ordered compacted tree T with zz > 0 leaves £\, ... ,£n when the lowest common 
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ancestors of adjacent leaves and the depth of a node are available. Fig.^shows 
the algorithm BottomUpTraverse for the problem. In the algorithm, we assume 
a special top nodeT such that T ^ for every v in T and special leaves io and 
£n+i such that lca{£o,£i) = lca{in,£n+i) = T (See Fig.^. 

Scanning the height array Height from left to right, the algorithm enumerates 
the nodes of T without duplicates by a sequence of push/pop operations to a 
stack S as follows. During the scan, a leaf node, say ik, is pushed into the stack 
S when it is first encountered at stage k and popped immediately at stage k+1. 
The case for internal nodes is more complicated (See Fig. 5. Conceptually, a 
node V is pusded when it is visited from below at the first time and popped when 
it is visited at the last time in the depth-first search of T. 

An internal node v is pushed into the stack when the leftmost leaf of the 
second child of v, say £k, is encountered at the first time in the scan, i.e., v = 
lca{ik-i,£k)- Then, v is popped from the stack S when the leftmost leaf of the 
next right sibling of v, is encountered in the scan. Then, p = lca{£k-i,£k) is the 
parent of v. Since the tree is compacted, the second leftmost leaf always exists 
for every internal node. Thus from Lemmafl the algorithm BottomUpTraverse 
enumerates all nodes without duplicates by a scan of Height. 

To see that the algorithm correctly computes the post-order traversal of T, we 
need to know the precise contents of the stack during the scan. A key observation 
is that if an internal node v is lca{£k-i, £k) for some k then v is on the fc-th branch 
from £k to the root and all nodes of H{£k-\) are proper descendants of v. We 
gives the following lemma without proof due to the space limitation (See for 
the complete proof). 

Lemma 5. Let us consider the algorithm BottomUpTraverse of Fig. For any 
stage 1 < k < n + 1, the contents of the stack S at the beginning of the k- 
th stage is the subsequence Sk = {vj„, . . . ,Vjf.) of the k-th branch iTk = (fo = 
£k, vi, . . . ,Vm = T) (m > 0) such that for every 0 < j < m, Vj G Sk if and only 
if the following inclusion condition holds at position j: either (i) j = 0 or (ii) 
Vj-i is not the leftmost child ofvj. 

From LemmaH we see that in the end of every stage fc, the fc-th RM branch 
H{£k-i) is stored on the top of the stack S. Then, H{£k-i) is deleted from 
the stack S when £k is encountered in the scan. By repeating this process, the 
algorithm finally outputs all RM branches H{£i),.. .,7T(£„) of T from left to 
right. Hence, the next lemma immediately follows from LemmaH 

Lemma 6. The algorithm BottomUpTraverse of Fig.^^computes the post-order 
traversal of an ordered compacted tree with n leaves in 0{n) time when the node- 
depth for a node and the lowest common ancestor of adjacent leaves are constant 
time computable. 

Now we present a linear time algorithm for the substring traversal problem 
when the height array and the suffix array of A is given. Fig. H shows the algo- 
rithm TraverseWithArray to compute the list of triples for text A generated by 
the post-order traversal of a suffix tree. In the algorithm, we encode a node v 
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Algorithm TraverseWithArray ; 

input: The height array Height and the suffix array Pos for a text A; 

1 S:= (-1, -1); n:=|T| /* Initialize the stack S */ 

2 for k:=l to n+1 do /* fc-th stage */ 

3 (Llca, Hlca) := (k-1, Height [k] ) ; 

4 (L, H) := top(S); 

5 while (H > Hlca) do 

6 (L, H) := pop(S), R := k-1; Then, report triple (L, R, H) ; 

7 Llca := L; /* Update the left boundary */ 

8 (L, H) := top(S) ; 

9 od 

10 if (H < Hlca) then 

11 Push((Llca,Hlca),S); fi; 

12 Push((k, n - Pos [k] + 1) , S) ; /* Set S,, = S */ 

13 od /* for-loop */ 

Fig. 7. A linear time algorithm for the substring traversal problem. 

by any pair (L, H) such that L and H are the any occurrence and the length of 
the substring str{v), respectively. The top node is encoded by (—1, —1). 

Recall that there were only two types of nodes processed in the algorithm 
BottomUpTraverse, a leaf and the lea of adjacent leaves. Thus for any rank 
1 < k < n + 1, we encode v by (L, H) as follows: (i) if v is the leaf ik 
then {L,H) = (fc, |Apos[fc]|) = {k,n— Pos[k] + 1) and (ii) if v is the lea node 
lca{£k-i,f^k) then (L, H) = {k — 1, Height[k\). The depth of the node v is obvi- 
ously given by H . From LemmaH we know that the algorithm correctly simulates 
Buttom Up Traverse. 

We then consider the computation of the rank intervals. Suppose a pair 
(L, H) is popped from the stack S at stage k and it represents a node v. By 
induction on the number of nodes below v on the (fc — l)-th path, we can show 
that L is the rank of the leftmost leaf of v, where the value of L is kept at the 
variable Llca at Line 7 of the algorithm. Since v is on the (fc— l)-th RM branch, 
R = k — 1 is obviously the rank of the rightmost leaf of v. Therefore, (L, R, H) 
is the triple of u, and the next theorem follows from LemmaH 

Theorems. The algorithm TraverseWithArray of Fig. computes in 0{n) 
time the list of all triples generated by the post-order traversal of the suffix tree 
of a text A of length n when the height array and the suffix array of A is given. 

Hence, the substring traversal problem is solvable in linear time when the 
height array Height of a text A is given. Since the algorithm TraverseWithArray 
makes only sequential I/Os and does not access the text A, we can also see that 
the algorithm is I/O efficient in the external I/O model of (See 

6 Experimental Results 

We run experiments on a real dataset. For the height array construction, we 
implemented the naive O(n^) time algorithm (Abbreviated as NaiveHeight) 
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and the linear time algorithm GetHeight (GetHeight). For the bottom-up sub- 
string traversal in Section^ we implemented the algorithm with the suffix tree 
(TravTree), the naive algorithm with binary search on the suffix array (TravBi- 
nary), and the algorithm TraverseWithArray (TravHeight). 

Table 1. Comparison of the computation time on English texts. 





Height array construction 


Snbstring traversal 


Algorithm 


NaiveHeight GetHeight 


TravTree TravBinary TravHeight 


Time (sec) 


17.59 7M 


2.07 13.62 1^ 



In Tabled we show the running time of the algorithms on an English text 
of 5.3MB Q and a workstation (Sun UltraSPARC 300MHz, 256MB, g-l— I- on 
Solaris 2.6). In the substring traversal, the preprocessing time for building the 
height array is not included. For the height array construction, we see from 
this table that GetHeight is faster than NaiveHeight more than twice on this 
test data. For the substring traversal, TravHeight is as fast as TravTree when 
the height array is precomputed, and faster than TravBinary even when the 
computation time of the height array is included. 
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Abstract. Compressed pattern matching is one of the most active top- 
ics in string matching. The goal is to find all occurrences of a pattern 
in a compressed text without decompression. Various algorithms have 
been proposed depending on underlying compression methods in the 
last decade. Although some algorithms for multipattern searching on 
compressed text were also presented very recently, all of them are only 
for Lempel-Ziv family compressions. In this paper we propose two types 
of multipattern matching algorithms on collage system, which simulate 
the AC algorithm and a multipattern version of the BM algorithm, the 
most important algorithms for searching in uncompressed files. Collage 
system is a formal framework which is suitable to capture the essence 
of compressed pattern matching according to various dictionary based 
compressions. That is, we provide the model of multipattern matching 
algorithm for any compression method covered by the framework. 



1 Introduction 

The compressed pattern matehing problem was first defined by Amir and Benson 
Q, and various compressed pattern matching algorithms have been proposed 
depending on underlying compression methods (see survey papers 

In we introduced a eollage system, which is a formal system to represent 
a string by a pair of dictionary T> and sequence S of phrases in T>. The basic 
operations are concatenation, truncation, and repetition. Collage systems give us 
a unifying framework of various dictionary-based compression methods, such as 
Lempel-Ziv family (LZ77, LZSS, LZ78, LZW), RE-PAIR Q, SEQUITUR Q, 
and the static dictionary based compression method. We also proposed in Q the 
simple pattern matching algorithm on collage system, which simulates the move 
of the Knuth-Morris-Pratt automaton Q running on the original text, by using 
the functions Jump and Output. 

In this paper we address the multiple pattern matching problem on collage 
system. That is, given a set U of patterns and a collage system {T>,S), we find 
all occurrences of any pattern in U within the text represented by {T>,S). It is 
rather easy to extend Jump to the multipattern case. However, the extension of 
Output is not straightforward because the single pattern version utilizes some 

A. Amir and G.M. Landau (Eds.): CPM 2001, LNCS 2089, pp. 2001. 

@ Springer-Verlag Berlin Heidelberg 2001 
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combinatorial properties on the period of the pattern. Although we have devel- 
oped a multipattern searching algorithm for LZW compressed texts in the 
same technique cannot be adopted to general collage systems. Nevertheless, we 
succeeded to develop an algorithm that runs in O ( ( 1 1 1? 1 1 -I- 1 5 1 ) • height{'D) +m?+r) 
time with 0(||I?|| -I- m^) space, where ||I?|| denotes the size of the dictionary T>, 
height{T>) denotes the maximum dependency of the operations in T>, |5| is the 
length of the sequence S, m is the total length of patterns, and r is the number 
of pattern occurrences. Note that the time for decompressing-then-searching is 
linear with respect to the original text length, which can grow in proportion to 
I S' I • 2 II® II on the worst case. Therefore, the algorithm is more efficient than the 
decompress-then-search approach . 

We also show an extension of the Boyer-Moore type algorithm presented in 
to multiple patterns. The algorithm runs on the sequence S, with skipping 
some tokens. It runs in 0{{height{'D)+m)\S\+r) time after an 0{\\'D\\-height{'D)+ 
rn?) time preprocessing with 0(||I?|| -|- m^) space. Moreover, we mention the 
parallel complexity of compressed pattern matching for a subclass of collage 
system in Section ^ Our result implies that the compressed pattern matching 
for regular collage system can be efficiently parallelized in principle. 



2 Related Works 

We presented in || a general pattern matching algorithm on collage system for 
a single pattern. The algorithm runs in 0((||I?|| -|- |5|) • height{T>) + m? + r) time 
with 0(||I?|| -I- m?) space. For the subclass of collage system which contains no 
truncation, it runs in 0(||T’|| -I- |5| -I- -I- r) time using 0(||T’|| -I- m?) space. We 

also presented a Boyer-Moore type algorithm in Q. 

Independently, Navarro and Raffinot developed a general technique for 
string matching on a text given as a sequence of blocks, which abstracts both 
LZ77 and LZ78 compressions, and gave bit-parallel implementations. The run- 
ning time of these algorithms based on the bit-parallelism for LZW is 0{nm/w + 
m + r), where n is the text length and w is the length in bits of the machine 
word. If the pattern is short (m < w), these algorithms are efficient in practice. 
A Boyer-Moore type algorithm for a single pattern on Ziv-Lempel compressed 
text is also developed 

3 Preliminaries 

Let A be a finite set of characters, called an alphabet. A finite sequence of 
characters is called a string. We denote the length of a string u by |u|. The 
empty string is denoted by e, that is, |e| = 0. Let A* be the set of strings over 
A, and let A+ = A*\{e}. Strings a;, y, and z are said to be a prefix, faetor, and 
suffix of the string u = xyz, respectively. A prefix, factor, and suffix of a string 
u is said to be proper if it is not u. Let Prefix{u) be the set of prefixes of a string 
u, and let Prefix{S) = UugS Pfefix{u) for a set S of strings. We also define the 
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sets Suffix and Factor in a similar way. For a string u,v G E*, let 

lpfy{u) = the longest prefix of u that is also in Factor{v), 
lsfy{u) = the longest suffix of u that is also in Factor{v), 
lpSy{u) = the longest prefix of u that is also in Suffix{v), 
lspy{u) = the longest suffix of u that is also in Prefix(v). 

For a set U of strings, let lpf^{u) be the longest prefix of u that is also in 
Factor{F[). We also define lsfjj{u), lpsjj{u), and lspjj{u) in a similar way. 

The zth symbol of a string u is denoted by u[i] for 1 < z < |u|, and the factor 
of a string u that begins at position i and ends at position j is denoted hy u[i : j] 
for 1 < z < j < |zz|. Denote by [®lzz (resp. zz^) the string obtained by removing 
the length z prefix (resp. suffix) from u for 0 < z < |zz|. The concatenation of z 
copies of the same string u is denoted by zzb The reversed string of a string u is 
denoted by u^. 

For a set A of integers and an integer k, let A (B k = {i + k \ i G A} and 
AQk={i — k\iG A}. For strings x and y, we denote the set of occurrences of 
a; in z/ by Occ{x, y). That is, Occ(a;, y) = {z | |a;| < i < \y\,x = y[i — |a;| + l : z]}. 
For a set 7T C of strings, Occ(7T, y) = {(z, a;) I i G Occ(a;,z/)}. Also 

denote by Occ* {x^u • v) the set of occurrences of x within the concatenation 
of two strings u and v which covers the boundary between u and v. That is, 
Occ*{x,u • v) = {i \ i G Occ{x,uv),\u\ < z < |zz| + |a;|}. For a set 7T C A+ 
of strings, Occ*{F[,u • v) = I * ^ Occ*{x,u • z;)}. We denote the 

cardinality of a set V hy \V\. 

A period of a string u is an integer p, 0 < p < \u\, such that a:[z] = x[i + p] 
for all z S {1, . . . , |a:| — p}. The next lemma provides an important property on 
periods of a string. 

Lemma 1 (Periodicity Lemma (see Q)). Let p and q be two periods of a 
string x. If p + q — gcd(p, q) < \x\, then gcd(p, q) is also a period of x. 

The next lemma follows from the periodicity lemma. 

Lemma 2. Let x and y he strings. If Occ(x, y) has more than two elements and 
the difference of the maximum and the minimum elements is at most |a:|, then 
it forms an arithmetic progression, in which the step is the smallest period of x. 

4 Collage System and Text Compressions 

A collage system Q] is a pair {D, S) defined as follows: is a sequence of assign- 

ments Xi = expr^-, X 2 = expr 2 , • ■ • ]X(,= expr^, where each Xk is a token and 
expr/. is any of the form 

a for a G E U {e}, (primitive assignment) 

XiXj for z,j < k, (concatenation) 

for i < k and an integer j, (prefix truncation) 

X^^ for i < k and an integer j, (suffix truncation) 

(Xi)^ for i < k and an integer j. (j times repetition) 
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Each token represents a string obtained by evaluating the expression as it implies. 
The strings represented by tokens are called phrases. The set of phrases is called 
dictionary. Denote by X.u the phrase represented by a token X. For example, 
V : Xi = a; X2 = b; X3 = c; X4 = • X2; X5 = X3 • X2; Xg = X4 • X2; 

X7 = (^4)^; Xs = X^\ then X4.U, X^.u, Xq.u, X-^.u, X^.u are ab, cb, abb, 
ababab, and abab, respectively. The size of T> is the number £ of assignments and 
denoted by IjT’ll. Also denote by F{T>) the set of tokens which are defined in 
V. That is, |jl?|| = \F(V)\ = £. Define the height of a token X to be the height 
of the syntax tree whose root is X. The height of T> is defined by height{T>) = 
max.{height(X) \ X in T>}. It expresses the maximum dependency of the tokens 
in V. 

On the other hand, S = , . . . , Xi^ is a sequence of tokens defined in T>. 

We denote by | 5 | the number k of tokens in S. The collage system represents a 
string obtained by concatenating strings Xi,,.u, • • • , Xi^.u. Most text compres- 
sion methods can be viewed as mechanisms to factorize a text into a series of 
phrases and to store a sequence of ‘representations’ of the phrases. In fact, var- 
ious compression methods can be translated into corresponding collage systems 
(see B). Both T> and S can be encoded in various ways. The compression ratios 
therefore depend on the encoding sizes of T> and S rather than ||T>|| and | 5 |. 



. collage system- 
- truncation- free — 
_ regular 



' LZ77 • LZSS 



' Run length 



• SEQUITUR • RE-PAIR • BPE 



- simple - 



• LZW • LZ78 



Fig. 1. Hierarchy of collage system. 



A collage system is said to be regular if it contains neither repetition nor 
truncation. A regular collage system is said to be simple if, for every assignment 
X = Y Z , \Y.u\ = 1 or \Z .u\ = 1. Through the collage systems, many dictionary- 
based compression methods can be categorized into some classes (see Fig.^- 
Note that the collage systems for the SEQUITUR and the RE-PAIR are regular, 
and those for the LZW/LZ 78 compressions are simple. 

5 Main Result 

Our main result is as follows. 

Theorem 1. The problem of compressed multiple pattern matching on a collage 
system {T>,S) can be solved in 0 ((||I?|| -I- | 5 |) • height{T>) + mf + r) time using 
0 (||I?|| -I- mf) space, where m is the total length of patterns in FI, and r is the 
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number of pattern occurrences. If T> contains no truncation, it can be solved in 
0(||I?|| + |5| + + r) time. 

We developed in an algorithm on collage system for a single pattern, which 
basically simulates the Knuth-Morris-Pratt (KMP) algorithm Although we 
devised a multipattern searching algorithm for LZ78/LZW in jAhe same tech- 
nique cannot be applied directly to even the case of regular collage systems. One 
natural way of dealing with multiple patterns would be a simulation of the Aho- 
Corasick (AC) pattern matching machine. Now, we start with the definitions of 
Jumpp^Q and Outputp^Q which play a key role in our algorithm. 

Let (5 ac '. Q y. S ^ Q he the deterministic state transition function of the 
AC machine for II obtained by eliminating the failure transitions (see [[]) . The 
set Q of states has a one-to-one correspondence with Prefix{II), and hence we 
identify Q with Prefix{n) if no confusion occurs. We shall use the terms “state” 
and “string” interchangeably throughout the remainder of this paper. Fig. ^ is 
an example for II = {aba, ababb, abca,bb}. Let Jumpp^Q be the state transition 
function. For a collage system {V, S) and II, define the function .lumpp^Q : Q x 
F{V) ^ Q by 



Jumpj^c{q,X) = 6 Ac{q,X-u). 

We also define the set Output p^Q{q, X) for any pair {q, X) in Q x F{T>) by 





\ 


u is a non-empty prefix of X.u such ) 


Output p^c{q, X) = < 


(|u|,7t) 


that 7T G 7T is one of the outputs of 
state s = 5ac{f ci). J 



That is. Output ac{f stores all outputs emitted by the AC machine during 
the state transitions from the state q reading the string X.u. The proposed 
algorithm can be summarized as in Fig.J For example, Fig.Jshows that the 
move of our algorithm on S for II = {aba, ababb, abca, bb}, where T> is the same 
as the example in Section Jand S = X4, X3, Xg, X 5 , X4, Xq. 



Concerning the function Jumpp^Q^q, X), we can prove the next lemma in a 
similar way to | by regarding the string obtained by concatenating all patterns 
in 7T as a single pattern. That is, for a set II = {tti, 7T2, • • • , tTs} of patterns, we 
make a string P = 7ri#7T2# • • • #7 Ts, where ff ^ X is a separate character. 

Lemma 3. The function JumpAcil^ can be realized in 0(||I?|| • height (T>) -I- 
mf) time using 0(||I?||-|-m^) space, so that it answers in 0(1) time. IfV contains 
no truncation, the time complexity becomes 0(||I?|| -I- mf), where m is the total 
length of patterns in II. 

On the other hand, the realization of Output aq is not straightforward, and 
we need some additional efforts, which will be stated in the next section. Now, 
we have: 



198 Takuya Kida et al. 



Input. A set U of patterns and a collage system where S = <S[1 : n]. 

Output. All positions at which a pattern n G II occurs in 5[l].u • • • 5[n].n. 

/* Preprocessing */ 

Perform the preprocessing required for Jump/^Q and Output/^Q 
(The complexity of this part depends on 77 and T). See Sectional 
/* Text scanning * / 
i := 0 ; 
state := 0; 

for k 1 to n do begin 

for each (p, tt) G Outputp^Q{state,S[k]) do 

Report an occurrence of tt that ends at position f + p ; 
state = Jumpj^^{state,S[k])-, 

£ ■.= £ + \S[k].u\ 

end 



Fig. 2. Pattern matching algorithm. 



Lemma 4. The procedure to enumerate the set Output (q, X) can be realized 
in 0(||I?|| • height{T>) + rri^) time using 0(||I?|| + m^) space, so that it runs 
in 0{height{X) +£) time, where £ is the size of the set Output X). If V 
contains no truncation, it can be realized in 0(||I?|| + mf) time and space, so 
that it runs in 0{£) time. 

Theorem ^follows from LemmaHcOid LemmaH 

6 Realization of Outputp^Q 

Recall the definition of the set Output p^Q^q, X). According to whether a pattern 
occurrence covers the boundary between the strings q S Prefix{n) and X.u, we 
can partition the set Output p^Q{q, X) into two disjoint subsets as follows. 

Outputp^Q{q, X) = Occ*{n, q • X.u) © jgl U Occ{II, X.u), 

We consider mainly the subset Occ{II, X.u) below. It is easy to see that we 
can enumerate the set Occ* {II, q» X.u) in 0(| Occ*(7T, q» X.u)\) time if we can 
enumerate the set Occ{II, X.u) in 0(| Occ(7T, X.u)|) time, because the former is 
essentially the same as the problem of the concatenation case of the latter. Thus, 
we concentrate on proving the following lemma. 

Lemma 5. For a collage system {T>, S) and a set II of patterns, we can enu- 
merate the set Occ{n, X.u) for X G F{T>) in 0{\ Occ{II, A.u)|) time after 0{mf) 
time and space preprocessing, assuming that the set Occ{II,Y.u), IpfpjfY.u), and 
IsfjjiY.u) are already computed for all Y such that T{Y) is a subtree ofT{X) 
in the syntax tree. 

Now, we begin to consider the case of regular collage systems. 
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Fig. 3. Aho-Corasick machine for II = {a6a, ababb, abca, bb}. 

The solid and the broken arrows represent the goto and the failure functions, 
respectively. The underlined strings adjacent to the states mean the outputs 
from them. 

S: \ Xi IXil Zg j Zs I X 4 j Z6 I 

original text: |a bjcja b a b|c bja bja b b| 

function 6: 0—1— 2— 6— 7— 2— 3— 4— 0—8 —1—2— 3— 4— 5 

Jumpj^^U, X) : 0 - 2—6 -4 -8 -2 -5 

OutpiU^f.(j,X) : <j> ,f> {l,abca} (t> <t> {l,aba} 

{ 3 , aba } { 3 , bb } 

{ 3 , ababb } 

Fig. 4. Move of our algorithm. 



6.1 For Regular Collage Systems 

It is obvious if A is a primitive assignment. If A is a concatenation, i.e. X = YZ, 
we have Occ{II, X.u) = Occ{II, Y.u) U Occ*{II, Y.u» Z.u) U Occ{II, Z.u) 0 |Y.u|. 
Assume that Occ{II,W.u), lpfjj{W.u), and lsfjj(W.u) are already computed for 
all W such that T{W) is the subtree of T{X) in the syntax tree. Then, we 
need to enumerate the set Occ* {II,Y.u • Z.u) in order to enumerate the set 
Occ{n, X.u). We can reduce the above problem to the following problem since 
Occ*{n, Y.u • Z.u) = Occ*{n, lsf]j{Y.u) • lpfjj{Z.u)). 

Instance: A set II of patterns and two factors x and y of 7T. 

Question: Enumerate the set Occ*{II,x • y). 

For the single pattern case, i.e. II — {tt}, it follows from LemmaHthat the set 
Occ*(7T, X uy) forms an arithmetic progression if it has more than two elements, 
where the step is the smallest period of tt. Thus the Occ*{tt, x»y) can be stored 
in 0(1) space as a pair of the minimum and the maximum values in it. The table 
storing those values can be computed in O(m^) time and space (see for its 
detail) . 

For the multipattern case, however, we cannot apply the above technique 
directly to the enumeration of Occ*(7T, x • y). Now, we prove the next lemma. 

Lemma 6. For a set II of patterns, we can enumerate the set Occ*(II, xuy) for 
all pairs of x G Prefix{n) and y G Suffix(n) in 0{\0cc*{II, x • y)\) time, after 
0{mf) time and space preprocessing. 

Proof. For any x € Prefix{II) and y G Suffix(n), we can build in 0{mf) time and 
space a table T that stores xy if xy G II , otherwise nil. Then, we can enumerate 
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Suffix{ll) 

be bca bcabc 



a ; i- abcabc 




Py 



Fig. 5. Short-cut pointers and the table T for II = {abcabc, cabb, abca{. 
In the left figure, o indicates that x'y' matches some pattern tt £ 77 
for x' € Suffix{x) and y' £ Prefix{y), and x indicates that it does not 
match. 



the set Occ*{n, x»y) for all pairs of a; G Prefix{II) and y G Sujfix{n) by using 
such table T as the following manner: for each x' G Suffix{x)r\Prefix{II) and y' G 
Prefix{y)r\Suffix{n) in the descending order of their length, report the occurrence 
of the pattern tt = x'y' if T{x' , y') yf nil. However, the time complexity for the 
enumeration in this way becomes 0{m?), not 0(| Occ*(7T, a; • ?/)|). Then, we add 
to each entry of the table T a pair of two short-cut pointers Px and Py in order 
to avoid increasing the time complexity, that is, for any pair of a; G Prefix{II) 
and y G Suffix{II), px and Py point to the longest proper suffix x' of x such that 
Occ{n, x' ■ y) ^ % or X = e, and the longest proper prefix of y such that xy is 
a pattern in II or y = e, respectively. Fig.^shows an example of the pointers 
and the table T, where x = abca and y = bcabc for II = {abcabc, cabb, abca}. 
Such pointers can be computed in O(m^) time by using the table T. Using 
these pointers, we can get the desired sequence of pairs of x' G Suffix{x) and 
y' G Prefix{y) in 0(| Occ*(7T, a; • y)\) time for any pair of a; G Prefix(II) and 
y G Suffix{n). In the running example, the obtained sequence is {abca, bcabc) 
{abca, be) {abca,e) — > {a, bcabc) — > {a, bca) — > (a,e) ^ (e,e)- The proof is 
complete. □ 

We thus finished the proof of Lemma Hrestricted to the class of regular collage 
systems. 

6.2 For Truncation-Free Collage Systems 

We need to solve the following problem for dealing with repetitions. 

Instance: A set II of patterns, a factor x of II, and an integer A: > 1. 
Qnestion: Enumerate the set Occ{II,x^). 

For the single pattern case, i.e. II — {tt}, we presented a solution in | by using 
Periodicity Lemma (LemmaJ. However the same technique does not work for 
the multipattern case. Now, we need to prove the next lemma. 

Lemma 7. For a set TT of patterns, x G Factor{II), and an integer k > 1, we 
can enumerate the set Occ{II, x^) in 0{\ Occ{II, a;^)|) time after 0{m?) time and 
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space preprocessing, assuming that Occ{II,x), Ipfuix), and lsfjj{x) are already 
computed. 

Proof. It is trivial for k < 2. Suppose k > 2. Note that we can enumerate 
the set Occ*{n,x • x) in 0{\0cc*{U,x • a;)|) time from Lemma H and that 
lpsij{x) = lpsYj{lpfYj{x)). We use a generalized suffix trie Q for a set II of 
strings {GSTn for short) in order to represent the set of suffixes of the strings in 
n . It is an extension of the suffix trie for a single string. Note that each node of 
the GSTn corresponds to a string in Factor{II). The construction of the GSTn 
takes 0{mf) time and space. 

Now, we have two cases to consider. 

Case 1: xx ^ Factor{II). Any pattern in II cannot cover more than three x’s. 
We can answer in 0(1) time whether xx is in Factor{II) or not since x G 
Factor{n) and the factor concatenation problem can be solved in 0(1) time. 
Moreover, we can obtain Oec*{II, x*xx) since we can obtain lpfn{xx) in 0(1) 
time (see |). Then, we can compute three sets Occ{II,x), Occ*{II,x • x), 
and Occ*{n,x • xx)\0cc*{n,x • x). Therefore, the set Occ{II,x^) can be 
enumerated in 0(| 0cc(7T, a::^)|) time using these sets, |a;| and k. 

Case 2: xx G Factor(II). For the pattern occurrences which are within three 
a;’s, we can enumerate them in the same way as Case 1. Now, we concentrate 
on the enumeration of the pattern occurrences that are not within three x's. 
Suppose that a pattern tt has such an occurrence. Then xx must be a factor 
of 7T. Since |a;| is a period of tt and 2\x\ < |7t|, it follows from Lemma | 
that I a; I is a multiple of the smallest period t of tt and therefore the set 
Occ{tt, x^) forms an arithmetic progression whose step is t. Thus the set can 
be enumerated in only linear time proportional to its size. However, some 
occurrences in the enumeration can be included entirely within three x’s. In 
order to avoid reporting them twice, we omit p in Occ{tt, x^) satisfying the 
inequation |7 t| — {p — ’ l^^l) > 2|a;| in the enumeration. So, we can 

enumerate all the pattern occurrences that are not within three x's in time 
linearly proportional to the number of them, if we have the list of the patterns 
7T G TT satisfying the conditions: (1) xx G Factor{Tr) and (2) |a;| is a period 
of 7T. The condition (2) can be replaced by the condition (2’): the smallest 
period of xx equals to that of tt. We add a list of the patterns tt that satisfy 
the conditions (1) and (2’) to each node of GSTn that represents a string 
XX = x^, called a square. It is not so hard to check up on the conditions 
in 0{mf) time for all nodes of GSTn- Each list added to a node of GSTn 
requires 0(|7T|) = 0(m) space. The number of nodes representing squares is 
0(m) (see |). Thus, the total space requirement is 0{mf). Therefore, we 
can enumerate the set Occ(II,x^) in 0(| Occ(7T, a;^)|) time with 0{mf) time 
and space preprocessing. 

The proof is complete. □ 

If Y.u Factor(n), since any pattern in TT cannot cover more than two Y.u’s, 
it is not hard to see that Occ{II, X) can be enumerated in 0(| Occ{II, X)|) time 
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using Occ(IJ,Y.u), Occ*{II,Y.u» Y.u), and k. We thus finished the proof 

of Lemma Hrestricted to the class of truncation-free collage systems. 



6.3 For General Collage Systems 

For general collage systems, we must deal with truncation operations, in addition 
to concatenations and repetitions. Using the same technique of the single pattern 
case, we can see that LemmaHholds ii X = or X = (see [^), and that 
the time complexity increase by height(T>) as the single pattern case does. That 
is, the next lemma holds. 

Lemma 8. We can build in 0(||T’|| • height{V) + m^) time using 0(||I?|| -I- m^) 
space a data structure by which the enumeration of Occ{II, X.u) is performed in 
0{height{X) + 1) time, where i = \ Occ{U, X.u)\. IfV contains no truncation, it 
can be built in 0(||I?|| -I- mf) time and space, and the enumeration requires only 
0{£) time. 

Lemma^follows from the above. Although we need lpfu{X.u) and lsfu{X.u) 
for X G F{T>), these can be computed in 0(||T’|| • height{'D) + mf) time using 
0(||I?|| -I- mf) space (see Q). 

7 On BM Type Algorithm for Multiple Patterns 

We proposed in a general, BM type algorithm for a single pattern on collage 
system. This algorithm is easily extensible to deal with multiple patterns if we 
use the techniques stated in Section^ We give a brief sketch of the algorithm. 

Recall that the BM algorithm on uncompressed texts performs the character 
comparisons in the right-to-left direction, and slides the pattern to the right 
using the so-called shift function when a mismatch occurs. Let lppsjj{w) denote 
the longest prefix of a string w that is also properly in Suffix{II). Note that 
the function is the state transition function of the (partial) automaton that 
accepts a set 7T^ = G 77} of reversed patterns. Contrary to the case of AC 

machine, the set Q of states is Suffix{n). Define the functions Jumpy^^y^ and 
OutputyY^yY as follows. For any state q G Suffix{n) and any token X G F{T>), 

{ lppsy[{X.u), ii q = e and lppsy[{X.u) ^ e; 

if 9 e; 

undefined, otherwise. 

Outputyi^yi(q, A) = {tt G 77 I wq = tt and w is a proper suffix of X.u}. 

The shift function is basically designed to shift the pattern to the right so as 
to align a text substring with its rightmost occurrence within the pattern. For 
a pattern tt and a string w, let 



£>0 



7r[|7r| — 7 — |w| -I- 1 : |7t| — 7] = w, 1 
or 7t[1 : |7t| — 7] is a suffix of w J 



rightmost_occ^{w) = min 
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/^Preprocessing for computing Outputy^^y^{j,t), and Occ{t) *j 

Preprocess the pattern tt and the dictionary T>\ 

/* Main routine */ 

focus := an appropriate value; 

focus := \m/C '\ ; 

while focus < n do begin 

Step 1: Report all pattern occurrences that are contained in the phrase S[focus].u 

by using Occ{t)- 

Step 2: Find all pattern occurrences that end within the phrase S[focus].u 

by using Jumpy^-^y^{j,t) and t); 

Step 3: Compute a possible shift A based on information gathered in Step 2; 

focus := focus + A 

end 



Fig. 6. Overview of BM type compressed pattern matching algorithm. 



For a set II of patterns, let rightmost_ocC]j{w) = {rightrnost_occ^ (tc)}, 

and let X) = rightmost_ocC]j{X.u • q). When we encounter a mis- 

match against a token X in state q G Suffix{II), the possible shift A of the focus 
can be computed using in the same way as Figure H gives an 

overview of our algorithm. 

We can prove the next lemma by using the techniques similar to those stated 
in Section^ and Theorem ^follows from Lemma^ 

Lemma 9. The functions Jumpyi^y^, Outputyi^yi, and Shiftyi^yi can be built 
in 0{height{T>) ■ \\T>\\ + m?) time and 0(||I?|| -I- mf) space, so that they answer 
in 0(1) time, where m is the total length of patterns in II . The factor height[T>) 
can be dropped ifV contains no truncation. 

Thus, we have the following theorem. 

Theorem 2. The BM type algorithm for multiple pattern searching on collage 
system runs in 0{height{T>) ■ {\\T>\\ + \S\) + \S\-m+mf+r) time, using 0(||I?||-|-m^) 
space, where m is the total length of patterns in II , and r is the number of 
pattern occurrences. If T> contains no truncation, the time complexity becomes 
0(||I?|| -I- |5| • m -I- -I- r). 

8 Parallel Complexity of Compressed Pattern Matching 

In this section, we consider the computational complexity of the following deci- 
sion problem for a class C of collage systems: 

Instance: A collage system {T>, S) in C over X and a set 7T = {tti, • • •, tTs} of 
patterns. 

Question: Is there any pattern ttj G II that occurs in the text T represented 
by {T>, S)1 That is, are there any i and j such that T[i : i -|- |7Tj| — 1] = tt^ or 
not? 
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LogCFL is the class of problems logspace-reducible to a context-free language. 
An auxiliary pushdown automaton (AuxPDA) is a nondeterministic Turing ma- 
chine with a read-only input tape, a space-bounded worktape, and a pushdown 
store that is not subject to the space-bound. The class of languages accepted 
by auxiliary pushdown automata in space s(n) and time t{n) is denoted by 
AuxPDA{s{n), t{n)). The next lemma is quite useful. 

Lemma 10 (^9). LogCFL = AuxPDA{logn,n^^^^). 

We now show the following theorem. 

Theorem 3. Compressed pattern matching problem on regular collage system 
is in LogCFL. 

Proof. We show an auxiliary pushdown automaton M that accepts an input 
string if and only if there is some pattern tt^ G LI that occurs in the text T 
represented by {T>,S). We note that by using pushdown store, M can traverse 
the evaluation tree of any variable and ‘scan’ the string Xk-u from left 
to right that is the sequence of leaves in the tree. Moreover, by utilizing the 
nondeterminism, M can scan any substring of Xk.u. 

M represents a position t of a pattern as a binary string in the worktape, 
and initializes it t = 1. For simplicity, we first consider the case that a pattern 
TTj occurs within the string Xi^.u for some Xi^. M nondeterministically guesses 
such j and k, and nondeterministically goes down the evaluation tree of Xk from 
the root by pushing the traversed variables in the pushdown store. At a leaf, M 
confirms that the character Xk.u[l] is equal to 7Tj[t]. Then M increments t by 
one by using the worktape, and proceeds to the next character Xk.u[l + 1] by 
using the pushdown store. M repeats this procedure until iTj is verified to occur 
in Xk at position 1. Remark that I is not explicitly written in the worktape: it is 
impossible in general since I — 0{\Xk\) = However, on the other hand, 

since t < |7Tj| and patterns are explicitly written in the input tape, the space 
required by M is 0(log|7rj|), that is logarithmic with respect to the input size. 
The computation time is clearly bounded by a polynomial, since the height of 
the evaluation tree is at most ||I3||. For a general case that a pattern tt^ spreads 
over a region • • • Xi^^ , we can show that M verifies the occurrences in 

polynomial time using a log-space worktape in the same way. By Lemma^J we 
complete the proof. □ 

Since it is known that LogCFL C NC^ the above theorem implies that 

the compressed pattern matching for regular collage systems can be efficiently 
parallelized in principle. For general collage systems including repetitions and 
truncations, we have not succeeded to show that the problems are in NC nor 
P-complete yet. 

9 Concluding Remarks 

We proposed two types of multipattern matching algorithms on collage system. 
One is an AC- type algorithm, which runs in 0((||I?|| -|- |5|) • height{T>) + m? + r) 



Multiple Pattern Matching Algorithms on Collage System 205 



time with 0(||I?|| + m?) space. Its running time becomes 0(||I?|| + |5| + m? + r) 
if a collage system contains no truncation. The other is a BM-type algorithm, 
which runs in 0{{height{V) + m)|5| + r) time after an 0(||T’|| • height{V) + m?) 
time preprocessing with 0(||I?|| + m?) space. We also showed that compressed 
pattern matching on regular collage system is in LogCFL C NC^. 

The compressed pattern matching usually aims to search in compressed files 
faster than a regular decompression followed by an ordinary search (Goal 1). A 
more ambitious goal is to perform a faster search in compressed files in compar- 
ison with an ordinary search in the original files (Goal 2). In this case, the aim 
of compression is not only to reduce disk storage requirement but also to speed 
up string searching task. In fact, we have achieved Goal 2 for the compression 
method called Byte Pair Encoding (BPE) 

In approximate string matching algorithms over LZW/LZ78 com- 

pressed texts were proposed. Very recently, Navarro et al. proposed a practical 
solution for the LZW/LZ78 compressions and showed experimentally that it is 
up to three times faster than the trivial approach of uncompressing and search- 
ing ^3- The basic idea of the solution is to reduce the problem of approximate 
string searching to the problem of multipattern searching of a set of pattern 
pieces plus local decompression and direct verification of candidate text areas. 
Using the same technique, the result of this paper leads to speed-up of approxi- 
mate string matching for various compression methods. In fact, we have verified 
that the suggested algorithm runs on BPE compressed texts faster than Agrep, 
known as the fastest pattern matching tool. 
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Abstract. Given k permutations of n elements, a fc-tuple of intervals 
of these permutations consisting of the same set of elements is called 
a common interval. We present an algorithm that finds in a family of k 
permutations of n elements all K common intervals in optimal 0{nk + K) 
time and 0{n) additional space. 

This extends a result by Uno and Yagiura {Algorithmica 26, 290-309, 
2000) who present an algorithm to find all K common intervals of fe = 2 
permutations in optimal 0{n + K) time and 0{n) space. To achieve our 
result, we introduce the set of irreducible intervals, a generating subset 
of the set of all common intervals of k permutations. 



1 Introduction 

Let n = (tti, . . . , TTfc) be a family of k permutations of fV = {l,2,...,n}. A 
fc-tuple of intervals of these permutations consisting of the same set of elements 
is called a common interval. 

Common intervals have applications in different fields. The consecutive ar- 
rangement problem is defined as follows |^BH : Given a finite set X and a col- 
lection S of subsets of X, find all permutations of X where the members of each 
subset S G S occur consecutively. Finding all common intervals of a set of permu- 
tations reverses this problem. Some genetic algorithms using subtour exchange 
crossover based on common intervals have been proposed for sequencing prob- 
lems such as the traveling salesman problem or the single machine scheduling 
problem In a bioinformatical context, common intervals can be used to 

detect possible functional associations between genes. It is supposed that genes 
occurring in different genomes in each other’s neighborhood tend to encode func- 
tionally interacting proteins If one models genomes as permutations of 

genes, the problem of finding co-occurring genes translates into the problem of 
finding common intervals. 
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Recently, Uno and Yagiura presented three algorithms for finding all 
common intervals of fc = 2 permutations tti and tt 2 '. two simple O(n^) time 
algorithms and one more complicated 0{n + K) time algorithm where K < (”) 
is the number of common intervals of tti and 7 T 2 . Since the latter algorithm runs 
in time proportional to the size of the input plus the size of the output, it is 
optimal in the sense of worst case complexity. 

An obvious extension of this algorithm to find all common intervals of a 
family U = (tti, . . . , tt^) of fc > 2 permutations would be to compare tti suc- 
cessively with TTi for i = 2 , . . . , fc and report those intervals that are common in 
all comparisons. This yields an 0{kn + J2i=2 algorithm where Ki is 

the number of common intervals of tti and tt^ for 2 < i < k. The main result 
of this paper is an improvement of this approach by a non-trivial extension of 
Uno and Yagiura’s algorithm, yielding an optimal 0{kn + K) time and 0(n) 
space algorithm where K is the number of common intervals of 7T. Note that 
this number can be considerably smaller than any of the Ki. 

The approach relies on restricting the set of all common intervals C to a 
smaller subset of irreducible intervals I, from which C can be easily recon- 
structed. While the number of common intervals can be as large as ( 2 ), we show 
that 1 < |/| < n — 1 and present an algorithm to compute I in optimal 0(kn) 
time, i.e., in time proportional to the input size. Knowing I we can reconstruct 
C in 0{K) time, i.e., in time proportional to the output size. Both algorithms 
use 0{n) additional space and their combination yields our main result. 

2 Permutations and Common Intervals 

Given a permutation tt of (the elements of) the set iV := {1, 2, . . . , n}, we denote 
by 7 r(i) = j that the ith element of tt is j. For x,y G N, x < y, [a;, y] denotes 
the set {a;, a; -I- l,...,y} C N and 7 r([a;,y]) := { 7 t(z) | i G [x,y]} is called an 
interval of tt. Let U = (tti, . . . , tt^) be a family of k permutations of N. W.l.o.g. 
we assume in the following always that tti = idn '■= (1, . . .,n). A fc-tuple c = 
([/i, ui ], . . . ,[lk, Uk\) with 1 < Ij < Uj < n for all 1 < _) < fc is called a common 
interval of U if and only if 

7Tl([^l,Ul]) = 7T2([;2,U2]) = ... = TTfc ( , Ufc] ) . 

This allows to identify a common interval c with the contained elements, i.e. 
c = 'Xj{[lj,Uj]) for 1 < j < fc. 

Since tti = idn, the above set equals the index set [li,ui], and we will refer 
to this as the standard notation of c. The set of all common intervals of U is 
denoted Cjj- Note that our definition excludes common intervals of size one. 

Example 1. Let N = 9} and TT = ( 711 , 712 , 713 ) with tti = idg, 7 T 2 = 

(9,8,4, 5,6, 7, 1,2,3), and tts = (1, 2, 3, 8 , 7, 4, 5, 6 , 9). We have 



Cn = {[1,2], [1,3], [1,8], [1,9], [2, 3], [4, 5], [4, 6], [4, 7], [4, 8], [4, 9], [5, 6]}. 
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3 Finding All Common Intervals of Two Permutations 

In order to keep this paper self-contained, here we briefly recall the algorithm RC 
(short for Reduce Candidate) of Uno and Yagiura that finds all K common 
intervals of fc = 2 permutations tti = idn and 7T2 of N in 0{n + K) time and 
0{n) space. For the correctness and analysis of the algorithm we refer to 

An easy test if an interval 7T2([a;, y]), 1 < a; < y < n, is a common interval of 
n = (tti, 7T2) is based on the following functions: 

l{x,y) := min7T2([a;,y]) 
u{x,y) := max7T2([a;,y]) 
f{x,y) := u{x,y) - l{x,y) - {y - x). 

Since f{x,y) counts the number of elements in [l{x , y) , u{x , y)] \ 7T2([a;,y]), an 
interval 7T2([a;, y]) is a common interval of U if and only if /(a;, y) = 0. A simple 
algorithm to And Cn is to test for each pair of indices (a;, y) with l<a;<y<nif 
f{x,y) = 0, yielding a naive O(n^) time or, using running minima and maxima, 
a slightly more involved O(n^) time algorithm. 

The main idea of Algorithm RC is to save the time to test /(a;, y) = 0 for 
some pairs (a;, y) by eliminating wasteful candidates for y. 

Definition 1. For a fixed x, a right interval end y > x is called wasteful if it 
satisfies /(a:', y) > 0 for all x' < x. 

In Algorithm RC (Algorithm^ j the common intervals are found using a data 
structure Y consisting of a doubly-linked list ylist for indices of non-wasteful 
right interval end candidates and, storing intervals of ylist, two further doubly- 
linked lists Hist and ulist that implement the functions I and u in order to 
compute / efficiently. They are also essential for an efficient update of ylist. In 
our pseudocode we use the standard list operations L.head for the first element 
of list L, L.succ{e) for the successor and L.pred{e) for the predecessor of element 
e in L. 



Algorithm 1 (Reduce Candidate, RC) 

Input: A family U = (tti = idn, of two permutations of A = {1, . . . , n}. 
Output: Cn in standard notation. 

1: initialize Y 

2: for X = n — 1, . . . ,1 do 

3: update Y // (see Algorithm^ 

4: y <— a; 

5: while (y ^ ylist.succ{y)) defined and f{x,y) = 0 do 

6: output [Z(a;, y), u(o:, y)] 

7: end while 

8: end for 
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In the first step of Algorithm^ ylist is initialized containing one element that 
stores the index n, and Hist and ulist are both initialized with the one-element 
interval [n, n] consisting of the last/only element of ylist. 

Then Y is updated iteratively. A counter x (corresponding to the currently 
investigated left interval end) runs from n — 1 down to 1. For any fixed x, the 
elements of Hist are maximal intervals of ylist such that for an interval [y, y'] we 
have l{x, y) = l{x, ylist. succ{y)) = • • • = l(x, y'); similar for ulist. For an interval 
[y, y'] in Hist or ulist, we define its value by val{[y, y']) := 7T2(y) and its end by 
end{[y, y']) := y' . Algorithm^shows the update procedure for 7T2 (j;) > 712 ( 0 ; -1-1). 
The case tt 2 {x) < tt 2 {x + 1) is treated in a symmetric way. 



Algorithm 2 (Update of data structure Y in line 3 of Algorithm^ 
1: prepend x at the head of ylist 
2: prepend [x, x] at the head of Hist 

3: while (u* <— ulist. head) has a successor u and val(u) < 772 ( 37 ) do 
4: delete u* from ulist and the corresponding elements from ylist 

5: end while 
6: y* <— end(u*) 

7: if (y <— ylist. succ{y*)) is defined then 
8: while f(x,y*) > f{x,y) do 

9: delete y* from ylist 

10: y* <— ylist.pred{y) 

11: end while 

12: end if 

13: update the left and right end of u* <— [ 37 , y*] 



First, index x is prepended at the head of ylist and [x, x] is prepended at 
the head of Hist. Then ylist is trimmed by deleting all elements y {> x) that 
can be concluded to be wasteful (lines 3-12). This is called Trimming_YLIST in 
Simultaneously, ulist is trimmed in line 3. Finally, the interval ends of the 
new head of ulist, u* , are updated. 

Coming back to Algorithm J Uno and Yagiura show that in iteration step 
X, after the update of Y , the function f{x,y) is monotonically increasing for 
the elements y remaining in ylist. This allows in lines 5-7 to find efficiently 
all common intervals with left end x by evaluating f(x, y) running left-to-right 
through ylist until an index y is encountered with f{x, y) > 0. 



4 Irreducible Intervals 



In this section we define the set of irreducible intervals and show how they can 
be used to reconstruct all common intervals. We start by characterizing the 
structure of the set of common intervals. 
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Lemma 1. Let II he a family of permutations. For ci, C 2 S Cn we have 
|ci n C 2 I > 2 o Cl n C 2 G Cn, 

Cl n C 2 ^ 0 Cl U C 2 G Cn- 

Proof. This follows immediately from the definition of common intervals. □ 

Two common intervals ci, C 2 S Cn have a non-trivial overlap if ci fl C 2 0 
and they do not include each other. A list p = {ci, ... , c^p)) of common intervals 
Cl, ... , Cf(p) G Cn is a chain (of length i{p)) if every two successive intervals in p 
have a non-trivial overlap. A chain of length one is called a trivial chain, all other 
chains are called non-trivial chains. A chain that can not be extended to its left 
or right is a maximal chain. By Lemma J every chain p generates a common 
interval c = r(p) := Uc'gp 

Definition 2. A common interval c is called reducible if there is a non-trivial 
chain that generates c, otherwise it is called irreducible. 

This definition partitions the set of common intervals Cn into the set of 
reducible intervals and the set of irreducible intervals, denoted In- Obviously, 
1 ^ 1 .^ 77 ! < \Cn\ < ( 2 )- For a common interval c G Cn we count the number 
of irreducible intervals that properly contain c and call this number the nesting 
level of c. 

Lemma 2. Let II be a family of permutations, c G Cn a common interval, and 
{bi, ... ,be) a chain of irreducible intervals generating c. The nesting levels of c 
and all the bi for i = are equal. 

Proof (Sketch). Let Uc be the nesting level of c and Ui the nesting level of bi 
for i = 1, . . . ,£. Since bi C c we have Ui > Uc for i = !,...,£. If > Uc, there 
exists an irreducible interval c* if) c with bi C c* and £ > 1. Now we distinguish 
between internal and terminal intervals bi in the chain. In both cases one can 
easily see that c* can be generated by smaller common intervals, contradicting 
the assumption that c* is irreducible. □ 

We can further partition In into maximal chains. This partitioning is unique. 
For a maximal chain p = (ci, . . . , cn^pf) and 1 < i < j < £{p), we call p[i,j] := 
(ci , . . . ,Cj) a subchain of p. 

Lemma 3. The set of common intervals that is generated from the subchains of 
the maximal chains of In equals Cn ■ 

Proof. This follows directly from the definition of the partition. □ 



Example 1 (cont’d). For IT = ( 711 , 712 , 713 ) as above, the irreducible intervals are 
In = {[ 1 , 2 ], [ 1 , 8 ], [ 2 , 3], [4, 5], [4, 7], [4, 8 ], [4, 9], [5, 6 ]}. 
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The reducible intervals are generated as follows: 

[1,3] = [1,2] U [2,3], 

[1,9] = [1,8] U [4,9], 

[4,6] = [4,5] U [5,6], 

A sketch of the structure of maximal chains of irreducible intervals and their 
nesting levels is shown in Figure J 



nesting level 0 
nesting level 1 
nesting level 2 
nesting level 3 
nesting level 4 



[4.9] 



[ 1 . 8 ] 



fl, 2 ] ! | 



[4.8] 



[4.7] 



]4,5] I I [5.6] 



Fig. 1. Visualization of the irreducible intervals in Ijj and their nesting levels. 



Lemma 4. Given two different maximal chains p\ and p2, exactly one of the 
following alternatives is true: 

— t{pi) and t{p 2) are disjoint, 

— t{pi) is contained in a single element of p2, or 

— t{p 2) is contained in a single element of p\. 

Proof. t{pi) and r(p2) are either disjoint or have a non-empty intersection. In 
the latter case, r(pi) and t{p2) cannot overlap non-trivially, because of the 
maximality of pi and p2- Therefore, w.l.o.g. suppose t{p2) Q t{pi). No element 
of p2 can overlap non-trivially with any element of pi, otherwise one could find 
an element of pi or p2 that is generated by smaller intervals, contradicting its 
irreducibility. This yields the existence of exactly one irreducible interval c of 
Pi that includes r(p2) completely, while no other element of pi overlaps with 
t[p2). □ 

Based on the above lemmas, we describe a linear time algorithm to recon- 
struct the set Cn of common intervals of a family of permutations U from its 
set In of irreducible intervals (Algorithm^. The algorithm partitions In into 
maximal chains (line 1 ). This can be done, for example, by the following three 
steps. First, In is partitioned according to the nesting level. This is possible in 
0(|/77|) time by applying a sweep line technique to all interval start and end 
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points. Then the intervals in the resulting classes are sorted by their left end. 
Using radix sort, this can also be done in 0 (|/t 7|) time. Finally, the classes are 
further refined at non-overlapping consecutive intervals, yielding the maximal 
chains of irreducible intervals. This again takes 0{\I[j\) time. Using LemmaH 
we create Cn by generating all subchains of the maximal chains (lines 2-4) . This 
takes 0{\Cn\) time. Since I/ 77 I < IC'ttI, Algorithmjtakes 0{\Cn\) time in total. 



Algorithm 3 (Reconstruct Cn from In) 

Input: In in standard notation 
Output: Cn in standard notation 
1: partition In into maximal chains pi,p 2 , . . . 

2 : for each = ( 61 , . . . , do 

3: output T{pm[i,j]) in standard notation for all 1 < i < j < I{Pm) 

4: end for 



The following theorem is the basis for the complexity analysis of our algo- 
rithm in the following section. 

Theorem 1. Given a family II = (tti, . . . , TTfc) of permutations of N = {1,2, 
. . . , n}, we have 1 < \In\ < n — 1. 

Proof. For each interval [j, j + 1], j = 1, . . . ,n — 1, of tti denote by G In 

the irreducible interval of smallest cardinality containing [j, _) -|- 1] . It is easy 
to see that is uniquely defined. For any c = [s, y] G Cn, a subset of 

{b[x,x+i], ■ • ■ , b[y-i,y]} generates c. This yields {b[jj+i] \ j = 1, . . . ,n - 1} = In- 

□ 



Example 2. The limits given in Theorem^are actually achieved. For II = {id. 2 k, 
(1, fc-|- 1, 2, k+2, . . ,,k, 2k)) we have Cn = In = {[1, n]}. For II = {idn, idn) we 
have Cn = {[*, j] | 1 < * < j < «•} and In = {[i,i + ^] \ I < i < n}. 

5 Finding All Irreducible Intervals of k Permutations 

In this section we present our algorithm that finds all irreducible intervals of a 
family II = (tti, 7T2, . . . , tt^) of fc > 2 permutations of iV = {1, . . . , n} in 0{kn) 
time. Together with Algorithm^this allows to find all K common intervals of 
n in optimal 0{kn -I- K) time. 

5.1 Outline of the Algorithm 

For 1 < i < fc, set Ili := (tti, . . . , 7Ti). Starting with In^ = {[j,j +1] | I < 
j < n}, the algorithm successively computes In^ from Int-i for i = 2, . . . , fc (see 
Algorithm^. To construct Ini from Int-i, we define the mapping 
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where for c G lui-i > V^i(c) is the smallest common interval c' € Cui that contains 
c. Since lui C Cui C Cui_i and, by LemmaH lui-i generates the elements of 
C/ 7 i_u also generates lui- One can easily see that c! G lui and that 

ipi is surjective, i.e. Ini = I c G This implies the correctness of 

Algorithm^ In Section^^we will show how ifi{Ini-i) can be computed in 0{n) 
time and space, yielding the 0{kn) time complexity to compute In (= Ink)- 



Algorithm 4 (Computation of Ink) 

Input: A family U = (tti = idn, 7T2, . . . , Hk) of k permutations of A = {1, . . . , n}. 
Output: In in standard notation. 

1: Im ^ ([l,2],[2,3],...,[n-l,n]) 

2: for i = 2, . . . , k do 

3: lui ^ {<Pi(c) I c G // (see Algorithm^ 

4: end for 

5: output Ink ia standard notation 



5.2 Computing from Ini_i 

For the computation of (pi{Ini-i) we use a modified version of Algorithm RC 
where the data structure Y is supplemented by a data structure S that is derived 
from . S consists of several doubly-linked lists of intervals of ylist, one for 
each maximal chain of Ini-i ■ 

Using 7Ti and tt^, as in Algorithm RC, the ylist of Y allows for a given x to 
access all non-wasteful right interval end candidates y of C(,ri, 7 ri) ■ The aim of 
S is to further reduce these candidates to only those indices y for which simul- 
taneously [x,y] G C'ni-i (ensuring [x,y] G CnJ and [x,y] contains an interval 
c G Int-i that is not contained in any smaller interval from Cni- Together this 
ensures that exactly the irreducible intervals [x^y] G Ini 3’'^® reported. 

An outline of our modified version of Algorithm RC is shown in Algorithm^ 
Since the first permutation handed to the algorithm has to be the identity and 
S (derived from Ini-i) is compatible only with the index set of tti, we supply 
the algorithm with idn = o tt^ and o tti instead of 7Ti(= idn) and tt^. 
(As usual, TT~^ denotes the inverse of permutation tt^.) This does not change the 
index set of the computed irreducible intervals. 

In line I of Algorithm^ Y is initialized as in Algorithm RC. To initialize S, 
lui-i is partitioned into maximal chains of non-trivially overlapping irreducible 
intervals as in line 1 of Algorithm^ For each such chain, S contains a doubly- 
linked clist that initially holds the intervals of that chain in left-to-right order. 
Moreover, intervals from different clists with the same left end are connected by 
vertical pointers yielding for each index x € N a, doubly-linked vertical list. It is 
not difficult to add the vertical pointers during the construction of the clists such 
that the intervals in each vertical list are ordered by increasing length (decreasing 
nesting level). 
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Algorithm 5 (Extended Algorithm RC) 

Input: Two permutations tti = idn and tt 2 = of A = {1, , n}; Ini_i in standard 
notation. 

Output: I Hi in standard notation. 

1: initialize Y and S 

2: for X = n — 1 , . . . ,1 do 

3: update Y and S // (see text) 

4: while {[x' ,y\ <— S .first_activejnterval{x)) defined and f{x,y) — 0 do 

5: ontput [Z(a;, i/), u(a;, t/)] 

6: remove [x' ,y\ from its active sublist // (the interval is satisfied) 

7: end while 

8: end for 



To describe the update of S in line 3 of the algorithm, we introduce the notion 
of sleeping, active, and satisfied intervals. Initially all intervals of the clists are 
sleeping. In iteration step x, all intervals with left end x become active and 
are included at the head of an (initially empty) active sublist of their clist. An 
interval remains active until it is satisfied or deleted. A clist L and the contained 
intervals are deleted whenever x becomes smaller than the left interval end of 
L.head. It might be that the right end y of an interval [x,y] at the time of 
activation is already deleted from the ylist. In this case, the interval is merged 
with the successing interval [x' , y'] in its clist, i.e. the corresponding two elements 
of clist are replaced by a new one, containing the interval [x, y']. If no successor 
exists, the interval [a;, y] is deleted. 

Concerning the function ipi, sleeping or active intervals correspond to irre- 
ducible intervals from lui-i whose images have not yet been determined. The 
status changes to satisfied when the image is known. 

The update of Y in line 3 is the same as in Algorithm RC, the only difference 
being that whenever an element y is removed from ylist and y is the right end 
of some active or satisfied interval, this interval is merged with its successor in 
its clist if such a successor exists, otherwise it is deleted. The resulting interval 
inherits the active status if one of the merged intervals was active, otherwise it 
is satisfied. (If both merged intervals are active, this reflects the case that (pi 
maps both intervals of to the same (larger) interval of Int-) Note that 

even though y can be the right end of many irreducible intervals, at any point of 
the algorithm y can be the right end of at most one active or satisfied interval. 
This is due to the fact that no two intervals of a maximal chain can have the 
same right end, and whenever two intervals from different chains have the same 
right end, the chain of the shorter interval is deleted before the longer interval 
is made active (cf. Lemma H. Hence it suffices to keep for each index y > x & 
pointer to the (only) active or satisfied interval with right end y. This right end 
pointer is set when the interval is made active and is deleted when the interval’s 
clist is deleted. 

In contrast to the simple traversal of the ylist in Algorithm RC, here the gen- 
eration of right interval end candidates in lines 4-7 is slightly more complicated. 
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Function S. first _activeJnterval{x) returns the first active interval [x',y] in the 
clist of the interval at the head of the vertical list at index x. If the right end y of 
this interval gives rise to a common interval [l{x, y), u{x, y)], i.e., if /(a;, y) — 0, 
a common interval of smallest size containing an active interval is encountered. 
Hence we have found an element of which then is reported in line 5. 

Therefore [a;', y] becomes satisfied and is removed from its active sublist in line 6. 
In case this interval was the last active interval of its clist, the pointer to the 
head of the vertical list at index x is redirected to the successor of the current 
head, such that in the next iteration S. first _activeJnterval{x) returns the left- 
most active interval from the clist with the next lower nesting level (if such a 
list containing an interval with left end x exists). 

This way we only look at elements of ylist that are candidates for right ends 
of minimal common intervals with left end x and that contain an active interval. 
S. first _activejinterval{x) generates these candidates in left-to-right order such 
that, since f{x, y) is monotonically increasing for the elements y of ylist and 
hence also for the elements of any sublist of ylist, by evaluating f(x, y) until an 
index y is encountered with f{x, y) > 0, all irreducible intervals from In^ with 
left end x are found. This implies the correctness of our implementation of ipi. 

The complete data structure S for II = (711,712,713) as in Examplejwhile 
processing index a; = 4 of permutation 713 is shown in Figure^ 



clists < 



ylist 



{ 




Fig. 2. Sketch of ylist and the clists while processing element a; = 4 of 713 for 
n = (711,712,713) as in Example B Shaded boxes represent sleeping intervals, 
boxes with thick solid lines represent active intervals, and boxes with thin lines 
represent satisfied intervals. Thick arrows connect the elements of the active 
sublists, solid vertical arrows denote vertical lists (the vertical pointer of index 
5 was deleted after reporting interval [5,6] in iteration step a; = 5), and dotted 
vertical arrows are the right end pointers. 



5.3 Analysis of Algorithm^ 

Since all operations modifying Y are the same as in Algorithm RC, this part 
of the analysis carries over from and we can restrict our analysis to the 
initialization and update of S. 
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The initialization of S in line 1, including the creation of the vertical lists, 
can easily be implemented in linear time in a way similar to the first step of 
Algorithm J 

In line 3, the intervals with left end x are easily found using the vertical 
lists, marked active, and prepended to the active sublists in constant time per 
interval. Since lui-i contains 0{n) intervals and since each interval is activated 
exactly once, this step takes overall 0{n) time. Moreover, each index y that is 
deleted from the ylist can cause the merge of two intervals. Since merging two 
neighbors in a doubly-linked list takes constant time, and since each of the in 
total n elements of ylist is deleted at most once, this part takes overall 0{n) 
time as well. 

As in Algorithm RC, the time required for reporting the output is propor- 
tional to the size of the output, here [//jJ < n. Using vertical list and active 
sublist, the first active interval is found in constant time. Hence, and since the 
removal of interval [x' ^ y\ from the active sublist in line 6 is a constant-time 
operation as well, the loop in lines 4-7 takes overall 0{n) time. 

Putting things together. Algorithm Ht^^kes 0{n) time and space. Since at 
any point of Algorithmjwe need to store only two permutations tti and tt^ and 
the current Ijji , we have 

Theorem 2. The irreducible intervals of k permutations of n elements can be 
found in optimal 0(kn) time and 0(n) additional space. 

Combining this result with Algorithm^ we get 

Corollary 1. The K common intervals of k permutations of n elements can be 
found in optimal 0{kn + K) time and 0(n) additional space. 
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Abstract. We formulate the GENERALIZED PATTERN MATCHING 
problem, a natural extension of string searching capturing regularities 
across scale. The special case of UNAVOIDABILITY TESTING is ob- 
tained for pure generalized patterns by fixing an appropriate family of 
text strings - the Zimin words. We investigate the complexity of this 
restricted decision problem. Although the efficiency of standard string 
searching is well-known, determining the occurrence of generalized pat- 
terns in Zimin words does not appear so tractable. We provide an expo- 
nential lower bound on any algorithmic decision procedure relying exclu- 
sively on the equivalent deletion sequence characterization of unavoid- 
able patterns. We also demonstrate that the four other known necessary 
conditions are not sufficient to decide pattern unavoidability. 



1 Introduction 

Numerous efficient algorithms have been developed to handle pattern matching 
in the exact or approximate cases when potential matchings are considered 
only between a symbol of the pattern and a symbol of the text. However, except- 
ing the well-known problem of finding consecutively repeated substrings Q Q, 
the question of detecting string regularities across different scales has remained 
largely unaddressed. Generalized patterns capture this wider focus by expand- 
ing the target of some potential matches beyond individual symbols of the text 
to include all nonempty substrings. Furthermore, the complexity questions aris- 
ing with this generalization represent an intriguing departure from the known 
polynomial time algorithms in the standard cases. 

Pure GENERALIZED PATTERN MATGHING (GPM), as alluded to by 
Gassaigne in Q] and explicitly defined in section | is clearly in NP. A given 
correspondence between a generalized pattern consisting purely of variables and 
a substring of the text can be efficiently verified by standard string searching 
techniques. However, the current pattern matching methods appear unlikely to 
yield an efficient deterministic solution to the pure GPM problem. The success of 
such sequential algorithms, fundamentally dependent on preprocessing of pat- 
tern, text, or associated structure, does not obviously generalize. In fact, the 
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results of this paper are intended to suggest an intrinsic computational difficulty 
at the heart of pure GPM. 

We focus on investigating the complexity of UNAVOID ABILITY TESTING 
(UT). As outlined in section^ the link between string unavoidability and gen- 
eralized pattern matching follows from the decidability results of Zimin As a 
special case of GPM, the problem instance depends solely on the pattern, with 
the associated text string chosen from the fixed family of Zimin words. Further 
results of Zimin Q and Bean et al. Q yield a method for constructing a match- 
ing correspondence which exists if and only if the pattern is unavoidable. The 
method depends on the existence of a sequence of deletions reducing a pattern 
to the empty string. We show that, interpreted as the appropriate algorithm, 
this reductive deletion method cannot decide pattern unavoidability without en- 
countering an inescapable exponential initial branching of the computation tree. 

To prove such an exponential lower bound on the deletion approach, we begin 
in section J by explicitly defining the bipartite graph induced by the deletion 
criteria. Gombinatorial analysis of the connected components yields theorem J 
This significant result permits a further restriction of the problem to patterns 
having a unique first step in the deletion sequence. The difficulty of determin- 
istically locating the correct initial deletion forms the basis of our complexity 
conclusions; we show that the unique choice must be made from among expo- 
nentially many valid possibilities. 

More precisely, section^ introduces a notion of size which reflects, to a de- 
gree, the number of choices a deletion reduction algorithm might face. With the 
appropriate idea of related patterns, we prove theorem^ a fundamental result 
about the possibility of creating specific combinatorial combinations of certain 
types of patterns. Applying theorem Jand its immediate consequences, begin- 
ning with experimentally located base cases, yields a progression of patterns 
with increasing difficulty in determining the correct deletion. 

Section Hfully addresses the complexity of UT under the deletion criteria 
algorithmic approach, by formulating the results in terms of the input pattern 
length and confirming the exponential distribution among valid deletion choices. 
Given these results, we conclude that an algorithm employing the deletion se- 
quence criteria cannot determine the unavoidability of arbitrarily many patterns 
without considering an exponential number of possibilities. This inescapable ex- 
ponential initial branching of the computational tree implies that UNAVOID- 
ABILITY TESTING can not be solved in polynomial time by such a deletion 
algorithm. 

Finally, section^addresses some alternate algorithmic approaches, based on 
the four other known necessary conditions. An optimal extension of a combina- 
tion of these four criteria can be interpreted as a binary tree structure. However, 
theorem ^Jshows that there exist arbitrarily many avoidable patterns also en- 
coded by this tree structure and satisfying the necessary conditions. Hence, the 
conditions are not sufficient to produce a polynomial time algorithm deciding 
pattern unavoidability. 
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2 Generalized Pattern Matching 



The collection of all possible non-empty strings over a finite (and non-empty) 
set of symbols S will be denoted S~^ . An overbar distinguishes elements of S~^ 
from individual string symbols, e denotes the empty string, and 5'+ U {e} = S* . 
Note that S* is the set of reduced strings, since se = es = s for all s S S' and 
by extension for all t G S* . Finally, < will refer to the substring relation on 
S+ X S+, that is t < s if there exist strings u,v G S* so that utv = s. 

Although pattern matching is typically phrased as a constructive problem, 
we are primarily interested in complexity issues. Hence, the decision problem 
associated with exact pattern matching can be stated: 

Problem: PATTERN MATCHING (PM) 

Instance: p G A+ and w G A+ 

Question: Is p < w? 

As remarked in section^ we wish to expand the concept of matching patterns 
to capture string regularities across scales. PM can be viewed as asking whether 
there exists an identity mapping between a pattern p and a subword of w. We 
broaden this notion of a matching correspondence by introducing a new set V of 
“variable” pattern symbols, which may map to non-empty words in , while 
the symbols in A remain “constant.” 

Definition 1 (Generalized Pattern Occurrence). For p G (P U A)+ and 

w G A+, p j w if there exists a map <f> \ {V \J A) ^ A+ such that <f>{a) = a for 
all a G A and 4>{p) < w under the induced homomorphism. 

Henceforth, any pattern or its occurrence should be considered in the context 
defined above. Note that (/) is a non-erasing homomorphism, restricted to the 
identity on A. As such, PM is the special case of GPM with only constant 
pattern symbols. “Pure” GPM will refer to the case when p G V~^, while a 
pattern that includes both variables and constants will be called “mixed” GPM. 

Problem: GENERALIZED PATTERN MATGHING (GPM) 

Instance: p G (P U A)+ and w G A+ 

Question: Does p \ wl 

The rest of this paper considers the complexity of UNAVOIDABILITY TEST- 
ING, a special case of pure GPM depending only on the given pattern. 



3 Unavoidability Testing 

Definition 2 (Unavoidable). ^ H A string s is called unavoidable if every 
infinite word on n letters has an occurrence of s as a pure generalized pattern. 
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The complementary characteristic of avoidability, often phrased in terms of 
sets of finite words, has been the primary research focus of this area. (A summary 
of current results can be found in Q.) The above definition, itself a statement 
about asymptotic pure GPM, is preferred for the complexity questions of this pa- 
per. Decidability results on unavoidable patterns utilize the Zimin words, which 
will be chosen as our family of text words, indexed by the number of distinct 
symbols appearing in each string 

Definition 3 (Zimin Words). Let Zi = ai and recursively define Zn on 
An = {oi, 02 , . . . , a„} as Z„ = Zn-ianZn-i- Equivalently, define a mapping 
9 : An-i An where 9{ai) = Oi+iOi. Then Zn = ai9(Zn-i). 

For any string s, let a(s) be the number of distinct symbols occurring in s. 
Also, for p G V+, it will be assumed that \V\ = a{p). 

Theorem 1. ^ p is an unavoidable pattern if and only if p occurs in Z^^p). 

Thus, theorem | states that the problem of determining pattern unavoid- 
ability is a special case of pure GPM depending only on the pattern with the 
associated Zimin word as text. 



Problem: UNAVOIDABILITY TESTING (UT) 

Instance: p G 
Question: Does p \ ^a(p)? 

One direction in theorem J follows immediately from the unavoidability of 
Zimin words themselves and the transitivity of generalized pattern occurrence. 
However, the other direction depends on an equivalent characterization of un- 
avoidable patterns in terms of the cr-deletion of free sets. Although the two 
characterizations are polynomially equivalent, cr-deletions and free sets offer a 
decided advantage in terms of algorithmic analysis. 

Definition 4 (Free Set). B H A C 1/ is free for p G V~^ if and only if there 
exist sets A, B C V such that F C B \ A where, for all xy < p, x G A if and 
only if y G B. 



The requirement on all substrings xy is called the “two-window” criteria. 
Note that the definition could equivalently require F C A \ B. 

Definition 5 (cr-Deletion). B B T'he mapping crp is a cr-deletion of p G 

if and only if F C V is a free set for p and ap ■ V ^ V U {e} is defined by 



(7f{x) 



X if X ^ F 
e if X G F 



^ The decision results published independently in | and B use different terminology 
and notation. For the purposes of this paper, whichever seemed the most appropriate 
was chosen. 
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Note that apiP) always refers to the reduced string in V* . Moreover, a match- 
ing correspondence lifts through a cr-deletion. Intuitively, the cr-deletion criteria 
permits variables to be “squeezed” into the mapping as needed. Hence, there is 
also the following unavoidability characterization. 

Theorem 2. p is an unavoidable pattern if and only if p can be reduced 

to e by a sequence of a-deletions. 

Since cr-deletions are designed specifically to permit a “bottom-up” construc- 
tion of the matching correspondence, there appears to be an immediate algo- 
rithmic advantage to this method of resolving UNAVOIDABILITY TESTING. 
Rather than searching through all possible matching correspondences, recur- 
sively generate a complete cr-deletion sequence, which is easily converted into 
the desired mapping. Moreover, the cr-deletion characterization is the more com- 
binatorially accessible, as the next section demonstrates. 

4 Minimality and Uniqueness 

A cr-deletion requires a free set F C B \ A, where A and B satisfy the two- 
window criteria. The minimal such sets can be simultaneously constructed by 
considering a pattern’s adjacency graph (also found in ^). 

Definition 6 (Minimal cr-Graph). Let Q{p) be the bipartite graph with ver- 
tices [a, a:] and [x, b] for every variable x in p and a,b ^ V. ([a, x], [y, &]) is an 
edge in Q (p) if and only if xy < p. 

For each connected component of Q{p), the projections of the left and right 
sides back down onto subsets of V yield sets A and B minimally satisfying the 
two- window criteria for p. A trivial Q(p) has only one connected component, 
and no possible cr-deletions. However, there certainly may be more than two 
pairs of minimal A and B sets. Furthermore, the necessity of considering unions 
of these minimal sets has not yet been ruled out. 

Call a cr-deletion and its associated free set F worthwhile if apiP) is un- 
avoidable, that is if a cr-deletion sequence beginning with ap reduces p to the 
empty string. Also, let = be the equivalence relation on the vertices of G{p) 
where two vertices are equivalent if and only if they are in the same connected 
component. 

Lemma 1. If p is unavoidable, then there exists a worthwhile a-deletion ap of 
p such that, for all yi,y 2 G F, [yi,b] = [y 2 , 6] in Q{p). 

The result follows from a proof that any worthwhile cr-deletion of p can be 
separated into free sets of equivalent variables, which may then be cr-deleted in 
sequence. A detailed justification of it, and all subsequent results, is given in Q. 
Additionally, the symmetry possible in the free set definition, where B \ A was 
arbitrarily chosen over A\B, should not be overlooked. Either by considering the 
implications of this dual definition or by investigating the relationship between 
the cr-deletions of a pattern p and its reverse (p)^, the following result is obtained. 
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Corollary 1. If p is unavoidable, then there exists a worthwhile a-deletion ap 
of p such that, for all yi,i /2 £ F, [a, yi] = [a, 2 / 2 ] in G{p)- 

Hence, G(p) is justifiably considered the minimal cr-graph ofp; unavoidability 
can be determined by considering only the free sets arising from the connected 
components of G{p)- Although an unavoidable pattern must have such a mini- 
mal (T-deletion, it is by no means unique. 

Theorem 3. If p is unavoidable and G(p) has at least three connected compo- 
nents, then there exists more than one minimal worthwhile a-deletion of p. 

The proof rests on showing that, under the given conditions, the order of the 
first two minimal cr-deletions can be reversed. For this paper, the result’s primary 
application is in restricting to unavoidable patterns with a unique worthwhile 
cr-deletion, insuring that G{p) has exactly two connected components. 



5 Unavoidable Combinations 

A pattern with a unique worthwhile cr-deletion has exactly two pairs of sets 
minimally satisfying the two-window criteria, which shall be denoted A, B, and 
A°, with F C B \ A. The size of a unique cr-deletion is introduced as a 
measure of the difficulty in choosing the corresponding free set. 

Definition 7 (cr-Deletion Size). Suppose p has a unique worthwhile a-deletion 
ap- Let |ctf| = ki/k 2 where k\ = |F| and k 2 = \ A|. 

Implicit in the notation is that \ap\ is measured with respect to a particular 
pattern p and its minimal cr-graph G{p)- In terms of the most general bounds, 
clearly 1 < k\ < k 2 - See theorem H for a statement of the precise relationship 
among possible k\, k 2 , and a{p). 

The difficulty of deciding pattern unavoidability on the basis of the a-deletion 
criteria will rest on demonstrating two facts. The first is that k 2 can increase 
exponentially as a function of |14| without necessitating an exponential increase 
in the lengths of the patterns involved. Secondly, for many such k 2 , there exist 
patterns where ki can take on all possible values between 1 and k 2 - 

These results are achieved by looking at specific combinatorial combinations 
of unavoidable patterns. As motivation, recall the recursive Zimin word defi- 
nition, noting that every Z„ has a unique cr-deletion of size 1/1. It is neces- 
sary to generalize only slightly this method of combining unavoidable patterns 
“buffered” by newly introduced variables by defining an appropriate restriction 
on equivalence classes of isomorphic patterns. 

Definition 8 (cr- Isomorphic). Suppose p £ V'^ and q G U'^ . Say q = if 
there exists a one-to-one mapping (f> : V ^ U such that (j){x) = x for x G V H U 
and under the induced isomorphism (j){p) = q. 
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The prime notation is carried through structures related under cr-isomor- 
phism. Although pattern unavoidability is preserved under unrestricted isomor- 
phisms, (T-isomorphisms constrain the possible relabeling of variables so that 
unavoidability is preserved under the Zimin-type construction, pzp' . Moreover, 
the following fundamental result states that, for specific kinds of unavoidable 
patterns and certain of their cr-isomorphisms, the construction yields new pat- 
terns with unique worthwhile cr-deletions. 

Theorem 4. Let p G V~^ with a{p) > 3. Assume p has a unique worthwhile a- 
deletion with associated sets A, B, Suppose further that p = sut and that 

either t A or s ^ B^ . Consider a a -isomorphism p' with (V \ V') C (B \ A). 
Then, for z ^ V U Y' , pzp' has a unique worthwhile a-deletion as well. 

The proof proceeds in several stages, beginning with consideration of Q{pzp'). 
The conditions of the theorem are sufficient to show, by exhausting all other pos- 
sibilities, that the union of the original free set and its cr-isomorphic image form 
the only worthwhile cr-deletion. According to those conditions, only variables of 
B\A may differ between p and p' . However, any subset of A may be relabeled 
by such a cr-isomorphism, yielding an immediate quantitative corollary. 

Corollary 2. Suppose p satisfies the requirements of theorem^^and \<Jf\ = 
ki/k 2 - Then, for 0 < i < fci and for 0 < j < k^ — ki, there exists an unavoidable 
pattern q with a{q) = a{p) 1 i j such that the unique a-deletion of q has 
size {ki -\- i) / {k 2 -\- i -\- j) ■ 

Beyond individual patterns, theoremjcan be inductively applied to yield a 
progression of non-empty equivalence classes of patterns. 

Definition 9. Let (ki/k 2 )n be the set of all unavoidable patterns having n dis- 
tinct variables and unique a -deletions of size ki/k 2 - 

Theorem 5. Given p G {ki/k 2 )n with p satisfying the conditions of theorem^^ 
then for 0 < < (2® — l)fci and 0 < ^2 < (2® — l)(fc 2 ~ ^i) with i > 0, 

(fcl -|- l\/k 2 -|- -|- l 2 )n-\-i-\-h-\-l 2 ^ 0 

Of course, these results are contingent on the initial existence of specific 
patterns, which has not yet been provided. However, when \V\ =3, 4, and 5, 
it is practically feasible to enumerate all nonisomorphic patterns and decide 
their unavoidability. Tables in the Appendix summarize the numerical results of 
such enumeration and decision programs, with the conclusion that the required 
patterns do indeed exist. 

Theorem 6. There exist patterns p satisfying theorem^^for a{p) = 3, 4, 5. 

Hence, progressions of collections of pattern classes {ki/k 2 )n are known to 
exist, where k 2 increases linearly with n and k\ takes on entire ranges of values. 
In fact, using results and techniques well beyond the scope of this paper, there are 
even worse implications for the algorithmic complexity of deciding unavoidability 
by a cr-deletion reduction. The constructive proof techniques do not cover the 
exceptional cases, whose emptiness has been exhaustively verified. 
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Theorem 7 . With the exception of (2/2)4 and (4/4)7, for l<ki<k 2 <n — 2, 
{ki/k 2 )n if and only if 1 + 2* > fci. 

6 Computational Complexity 

Thus far, the difficulty of constructing a unique worthwhile cr-deletion has been 
expressed in terms of the ratio of |F| to The previous results have demon- 

strated that, with increasing numbers of variables, ever greater ranges of free set 
cardinalities must be considered. However, the complexity of an algorithm solv- 
ing UNAVOIDABILITY TESTING is measured with respect to its input size, 
the length of a given pattern. The following theorem confirms the existence of 
patterns p with \p\ = 0{a{p)) and unique worthwhile free sets with sizes ranging 
from 1 to 0(a(p)). 

Theorem 8 . Suppose i > 0 and fc2 = 2 * -|- 1 . For all 1 < fci < k 2 , there exist 
p € (ki/k 2 ) 2 *+i +4 such that \p\ < 2 *+'^ — 1. 

Although the constructive proof depends on an inductive application of the- 
oremj most patterns satisfying the length bound of |p| < 16 • a{p) in theorem | 
will not be of the form pzp' . Furthermore, k 2 > |p|/16 so that with longer pat- 
terns (and a greater number of variables) the size of i? \ A increases arbitrarily 
while the unique worthwhile free set may be of any size from 1, . . . , k 2 - However, 
because this growth has been shown only for set cardinalities, demonstrating a 
truly exponential branching of the cr-deletion decision tree requires considering 
the actual distribution of free sets among unavoidable patterns. 

Consider these patterns satisfying the conditions of theoremjwith a{p) = 5, 
\p\ < 15, and B\A = {x, ?;}. 

xyxzuzyvyx and F = {x} 

xyzxuzvuv and F = {z;} 

xyxzuvyvzy and F = {x, ?;} 

Replacing either xov v with t and combining psp' under theoremHyields patterns 
with B \ A = {x,v,t} where F may be any one of {a;}, {z;}, {a;,f}, {z;, t}, 
{x,v,t}- Generalizing this technique demonstrates that there exist patterns p 
with the same associated B \ A sets where the unique choice for a worthwhile 
cr-deletion may be any one of an exponential, in |p|/16, number of free sets F. 
Hence, the criteria F C B \ A is genuine; a deterministic algorithm attempting 
to decide unavoidability on the basis of the cr-deletion criteria faces patterns 
with exponentially many valid choices for the initial free set. Thus there is an 
exponential lower bound on deciding UT by means of cr-deletions. While this 
does not rule out other better algorithms, further results show that the other 
known necessary conditions are not sufficient. 
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7 Insufficiency 

The following four necessary conditions can be extracted from the definition 
and decision characterization of unavoidable patterns. Note that the last three 
criteria are clearly verifiable in polynomial time. 

1 . q is unavoidable for any q < p, 

2. \p\ < - 1, 

3. p has an isolated variable, 

4. G{p) is nontrivial. 

The first condition follows immediately from the definition of string avoid- 
ability, the second from the characterization that p \ Za(p), while the third comes 
most clearly from the cr-deletion reduction of p to a single variable, and the fourth 
from the cr-deletion definition. Conditions strongly suggest a “divide- 

and-conquer” heuristic for an efficient decision algorithm. Having located the 
necessary isolated variable, the pattern may be broken into two substrings. Each 
of which must also have an isolated variable, if nonempty and likewise unavoid- 
able, and so can be further subdivided. Two different algorithms are possible, 
depending on whether the second branching occurs at the same variable in the 
right and left substrings or not. It can easily be seen that requiring the subdi- 
vision to occur at the same variable would lead to false negatives, while uncon- 
strained division ultimately leads to arbitrarily many false positives for reasons 
similar to those outlined below. 

A third and optimal algorithmic approach would require that substrings be 
divided at the same variable, but permitted to remain intact if this is not possible, 
so long as at least one division does occur for each recursion. This algorithm can 
be embodied by a rooted binary tree structure, with a level for each distinct 
variable. In addition to the nodes labeled by variables, each of which has two 
children, there are e nodes which can be the parent of only one node and hence 
are considered “placeholders” for an eventual variable node. Let 'T{V) be the 
collection of all such trees for some variable set V. For T G T (E), let tt{T) be the 
pattern which can be reconstructed from T by reversing the splitting algorithm. 
Note that the number of nodes of T G 17(E) can not exceed 2l^l — 1, so 7r(T) 
will automatically satisfy the length bound of condition J 

Theorem 9. For unavoidable p G V~^ , there exists T G T{V) with n{T) =p. 

Since T{V) were motivated by combining and extending conditions 
necessary for unavoidability, the result is to be expected. However, the proof 
depends on an inductive construction using a cr-deletion sequence for p, and 
so implies the difficulty of uniquely associating a T G T(V) with a given p. If 
the converse of theoremHwere also true, then pattern unavoidability would be 
decided efficiently by the recursive algorithm under consideration. And, in con- 
junction with verifying the nontriviality of G{p), the correspondence between T 
and unavoidable p is exact (although non-unique) when |E| = 3 and 4. However, 
the situation deteriorates rapidly after this. 
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When |y| = 5, there are now known to be 55,572 nonisomorphic unavoid- 
able patterns. But there are a grand total of 2,384,331 other nonisomorphic p 
with p = 7r(r) for some T G Of those, 60,702 have nontrivial G{p) of 

which 3720 have no avoidable proper subpatterns. Thus, there exist thousands 
of avoidable patterns with five variables which satisfy the four conditions listed 
above. Moreover, given even one, it is possible to generate arbitrarily more. 

Theorem 10. Let q G V~^ be an avoidable pattern. Suppose that 

1. there exists T G T{V) with tt{T) = q, 

2. q has nontrivial G(q), 

3. and all proper subpatterns of q are unavoidable. 

Then there exist infinitely many r such that r also satisfy the three conditions 
above and r is not unavoidable either. 

One of the implications of the result is that q ^ r, which prevents the po- 
tential decision algorithm from being adjusted to take finitely many such q into 
consideration. It also indicates that the Zimin- type construction, which was used 
to successfully in the results of section H will not be the basis for the proof of 
this theorem. Instead, the concept of “expanding” selected nodes of the tree 
structure will be exploited by replacing the instances of a variable by another 
pattern. 

Definition 10. Let p G V'^ and q G U'^ . Suppose that y is a variable of q and 
that {U \ {y}) n y = 0. Define p{q,y,p) to be the pattern obtained by replacing 
every instance of y in q by p. 

The restriction that (t/\{y})ny = 0 optimally insures the consistency of 
the associated tree structure. 

Theorem 11. p(q,y,p) is unavoidable if and only if both q and p are. 

The proof in the more complicated direction depends on showing that the 
cr-deletion sequence for p can be inserted appropriately into the cr-deletion se- 
quence for q to produce one which reduces p(g, y,p) to the empty string. Hence, 
the proof of theorem^Jfollows almost immediately from theorem^] Note that 
although q does exist as a subsequence of f = p(q,y,p), it can be shown that 
only subsequences generated by cr-deletions are useful in deciding pattern un- 
avoidability. 

Consequently, the other known necessary conditions are not sufficient for 
deciding pattern unavoidability, while the cr-deletion decision procedure can not 
be effected in subexponential time. Additionally, there are no clearly easier means 
of determining whether p \ Za{py Hence, this special case of GPM is known to 
be in NP and, in the absence of a more efficient characterization, is not expected 
to succumb to any polynomial time algorithm. 
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A Appendix 



Table 1. Decomposition of (fci/fc2)„ sets according to the four possibilities for 
p = sut relevant to theorem ^ Only the second column is excluded under the 
theorem’s assumptions. The symmetry between the first and fourth cases is due 
to the dual nature of the cr-deletion definition. 



{klfk2)n 


t ^ s ^ B 


t £ A,s £ B’^ 


t£A’^,s£B 


t £ A’^,s £ B’^ 


(1/1)3 


2 


0 


4 


2 


(1/1)4 


63 


34 


116 


63 


(1/2)4 


9 


0 


20 


9 


(1/1)5 


7571 


5072 


12, 156 


7571 


(1/2)5 


2558 


1432 


4544 


2558 


(1/3)5 


117 


0 


362 


117 


(2/2)5 


153 


41 


411 


153 


(2/3)5 


6 


0 


29 


6 
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Table 2. The complete numeration of nonisomorphic unavoidable patterns with 
3, 4, and 5 distinct variables, respectively. Each group of results is broken down 
by pattern length, and then further refined by considering only those patterns 
with a unique cr-deletion of size ki/k 2 - 
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Abstract. We consider the problem of developing an efficient tree- based 
data structure for storing and searching large data sets. It is assumed 
that the data set is stored in secondary storage, and hence the goal is 
to reduce the number of accesses to the storage. The number of accesses 
is measured by the number of edges of the tree that get accessed while 
processing a query. 

The data consists of a large number of words, each word being a string of 
characters. Each query consists of searching for a given word (member). 
Our goal is to design a simple data structure that permits efficient ex- 
ecution of the queries. The special case when the words are the suffixes 
of a string is the classic suffix tree data structure. 

We discuss different approaches for reducing the depth of as suffix tree. 
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Abstract. We present a fast algorithm for optimal alignment between 
two similar ordered trees with node labels. Let S and T be two such 
trees with |S| and \T\ nodes, respectively. An optimal alignment between 
S and T which uses at most d blank symbols can be constructed in 
0{nlogn ■ (maxdeg)* ■ d?) time, where n = maxUSI, |T|} and maxdeg 
is the maximum degree of a node in S or T. In particular, if the input 
trees are of bounded degree, the running time is 0(n log n ■ d^). 



1 Introduction 



Let ii be a rooted tree. R is called a labeled tree if each node of R is labeled by 
a symbol from a fixed finite set E. R is an ordered tree if the left-to-right order 
among siblings in R is given. 

The problem of determining the similarity between two labeled trees occurs 
in several different areas of computer science. For example, in computational 
biology, methods for measuring the similarity between ordered labeled trees 
of bounded degree can be used in the comparison of RNA secondary struc- 
tures The problem also occurs in evolutionary trees comparison, organic 

chemistry, pattern recognition, and image clustering 

The similarity between two labeled trees can be defined in various ways 
analogous to the ways of defining the similarity between two sequences 
For example, one can look for the largest maximum agreement subtree, the 
largest common subgraph, the smallest common supertree, the minimum tree 
edit distance etc. 

In Q, Jiang et al. generalized the concept of an alignment between sequences 
to include labeled trees as follows. An insert operation on a labeled tree adds a 
new node u which is labeled by a blank symbol A (space) not belonging to E. 
The operation either (1) turns the current root of the tree into a child of u and 
lets u become the new root, or (2) makes u the parent of a subset of (if the tree 
is unordered) or consecutive subsequence of (if the tree is ordered) children of 
an existing node v, and u a child of v. An alignment between two labeled trees 
is obtained by performing insert operations on the two trees so they become 
isomorphic when labels are ignored, and then overlaying the first augmented 
tree on the other one. The score of the alignment is the sum of the scores of 
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all matched pairs of labels, where the score of a pair of labels is defined by a 
given function ^ : (ifU {A}) x (ifU {A}) — > Z. An optimal alignment between a 
pair of labeled trees is an alignment between them achieving the highest possible 
scor^ See Fig.^for an example. 




Fig. 1. Example: Let E = {a, 6, c, d, e} and define the scoring function pL as 
= 3, fi{x,y) = —1, y{x,X) = y{X,x) = —2, y{X,X) = —2 for all 
x,y € E with x ^ y. Then the score of the alignment in (c) of the two ordered 
trees shown in (a) and (b) is equal to 2. 



Jiang et al. presented an 0(n^(maxdeg)^)-time algorithm for computing an 
optimal alignment between two ordered trees with node labels, where n stands 
for the maximum number of nodes in one of the input trees, and maxdeg for the 
maximum degree of a node in the input trees. They also provided a polynomial 
time algorithm for finding an optimal alignment of two unordered trees in case 
maxdeg — 0(1), and showed the latter problem to be MAX SNP-hard in general. 

Inspired by the known fast method for an optimal alignment between similar 
sequences (see Section 3.3.4 in ^), we give a fast algorithm for optimal align- 
ment between two similar ordered trees with node labels. If there is an optimal 
alignment between the two input ordered trees which uses at most d blank sym- 
bols then our algorithms runs in 0(n log n • (maxdeg)'^ ■ d?') time. Hence, under 
a natural assumption on the scoring, if the maximum possible score of an align- 
ment between the two trees is 0{d) apart from the score of a perfect alignment 
of the first tree with itself then the algorithm runs in 0(n log n • (maxdeg)'^ ■ 
time. In particular, if both trees are of bounded degree the running time reduces 
to 0{nlogn ■ d^). 



^ In fact, Jiang et al. consider the arithmetically complementary distance measure of 
an alignment which is the subject of minimization 
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2 d-Relevant Pairs 

The general idea of our algorithm is to modify the dynamic programming algo- 
rithm of Jiang et al. to only consider what we call d-relevant pairs of subtrees 
or subforests. In order to introduce our slightly technical concept of d-relevance 
we need the following definition. 

Definition 1. For an ordered tree T and a node u ofT, T[u] denotes the ordered 
subtree ofT rooted at u. When u is not the root ofT, T[u] stands for the ordered 
subtree ofT resulting from removing T[u] and the edge between u and the parent 
of u from T. Next, L{T, u) denotes the set of leaves in T that are to the left of the 
leaves ofT[u\. The number of nodes in T is denoted by |T| and the cardinality 
of L{T,u) by \L(T,u)\. 

Now, we are ready to introduce the concept of d-relevant pairs of subtrees as 
well as those of d-descendant and d-ancestor. 

Definition 2. Let d be a positive integer. For two ordered trees S,T containing 
nodes u and v respectively, the pair of subtrees (5'[u],T[w]) is called d-relevant if 
||S'[u]| — |T[w]|| < d and ||L(S', m)| — |L(T, u)|| < d. For a node w ofT, T[w] is 
called a d-descendant of T[v] if w is a descendant of v in T and |T[u]| — |T[ix;]| 
< d. Symmetrically , T\w] is called a d-ancestor ofTlv] if w is an ancestor of v 
inT and |T[u>]|- |T[u]| < d. 

The definition of d-relevance immediately yields the following lemma. 

Lemma 1. Let S, T be two labeled ordered trees, and let u, v be two nodes in S 
and T respectively. Lf there is an alignment between S and T which uses at most 
d blank symbols (spaces) and consists of an alignment between 5'[u] and T[v] and 
an alignment between 5[u] and T[v] then is d-relevant for S and T. 

The next three lemmas will be useful for bounding the number of d-relevant 
pairs from above. 

Lemma 2. Lf the pairs (S'[u],T[u]) and T[rc]) are d-relevant for two or- 

dered trees S and T, and w is a descendant (or, ancestor) of v inT then T[w] 
is a 2d-descendant (or, 2d-ancestor) ofT[v]. 

Proof. Since (S'[u], T[v]) is d-relevant, it holds that ||<S'[u]| — |T[w]|| < d. Suppose 
that T[w] is not a 2d-descendant of T[v] in T, i.e., |'T[u]| — |T[w]| > 2d. Then 
we have \S[u] \ — |T[w] | = \S[u] \ — |r[u] | -I- |T[u] | — |T[w] | > — d -I- 2d = d, which 
contradicts the d-relevance of (S'[u], T[w]). □ 

Lemma 3. For a node u of an ordered tree S, the number of d-ancestors of S[u] 
is at most d. 

Proof. Assume that the number of d-ancestors of S'[u] is greater than d. By 
the pigeonhole principle there exists a d-ancestor S'[u'] whose root u' is located 
at distance greater than d from u. But then |S'[u']| — |<S[u]| > d, which is a 
contradiction. □ 
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Lemma 4. Let be a sequence of distinct d-relevant pairs in 

two ordered trees S,T such that for any 0 < i, j < I, Vi is not a descendant ofvj. 
Then, I < 2d holds. 

Proof. We may assume w.l.o.g. that the sequence is ordered according to the 
left-right order in T. Since r[u/]) is d-relevant, ||L(iS', u)| — \L{T,vi)\\ < d 
holds. On the other hand, we have \L{T,vi)\ — \L{T,vq)\ > 1. Hence, if ^ > 2d 
then \L{S,u)\ — \L{T,vo)\ > d, which contradicts the d-relevance of r[uo]). 

□ 

By combining the three lemmas above, we obtain an upper bound on the 
number of d-relevant pairs of subtrees. 

Theorem 1. For two ordered trees S, T and a node u of S, the number of dis- 
tinct d-relevant pairs of subtrees in which u participates is O(d^). Consequently, 
there are OdiS”! • df) d-relevant pairs of subtrees for S, T. 

Proof. Let r[uJ)}^^Q be a maximal sequence of distinct d-relevant pairs 

of subtrees for two ordered trees S, T such that for each 0 < i ^ there is no d- 
relevant pair (<S'[u], T[u]), where u is a descendant of Vi. It follows from Lemma| 
that for each d-relevant pair (^[u], T[w]), it either belongs to the sequence or 
T[w] is a 2d-ancestor of a member in the sequence. Hence, the number of d- 
relevant pairs in which u participates is at most (2d -I- 1) ■ (^ -I- 1) by LemmaJ 
Now, it is sufficient to observe that I cannot exceed 2d by LemmaJ □ 



2.1 d-Relevant Pairs of Subforests 

The dynamic programming algorithm of Jiang et al. recursively computes scores 
not only between pairs of subtrees of the input trees but also between some 
pairs of subforests of the trees. Therefore, in order to modify this algorithm, we 
need to generalize the concept of d-relevance for pairs of subtrees to include the 
aforementioned pairs of subforests. For this purpose, we introduce the following 
technical notations. 

Definition 3. For an ordered tree S and a node u of S, let d„ be the degree 
of u and denote the children of u by ui,...,Ud„, according to their left-to-right 
order. S{u,i,j) refers to the ordered forest S[ui], ..., S[uj], and S{u) is short for 

S{u,l,du). 

Thus, S{u) is the complete ordered forest obtained by removing u and all 
edges incident to u from ^[u]. Also note that S{u, i, i) = 

Definition 4. Let S{u,i,j) be an ordered forest in an ordered tree S. S{u,i,j) 
stands for the ordered subtree of S obtained by removing S{u,i,j) and all edges 
incident to S{u,i,j) from S. L{S{u,i, j)) denotes the set of leaves in S that are 
to the left of the leaves of S{u,i,j). The number of nodes in S{u,i,j) is denoted 
by \S{u,i,j)\ and the cardinality of L{S{u,i,j)) by \L(S{u,i, j))\. 
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Now, we are ready to generalize the concept of d-relevance as well as those of 
d-descendant and d-ancestor for pairs of nodes inducing full subtrees to include 
pairs of subforests of the form {S{u, i,j),T{v, k, 1)). 

Definition 5. Let d be a positive integer. For two ordered trees S, T eontaining 
nodes u and v respeetively, the pair of ordered subforests {S{u, i,j), T{v, k, /)) is 
ealled d-relevant if\\S{u,i,j)\ — \T{v,k,l)\ \ < d and | |L(5'(u, i, j))\ — \L(T{v, fc, /))| | 
< d. For a node w ofT, T{w,k',l') is ealled a d-descendant of T{v,k,l) if w 
is a deseendant of v in T and \\T{w,k\l')\ — \T{v,kJ)\\ < d. Symmetrically, 
T(w, k' , I') is called a d-ancestor of T(y, k, 1) if w is an ancestor of v in T and 
\\T{w,k',n\-\T{v,k,l)\\<d. 

The definition of d-relevance of subforests immediately yields the following 
lemma analogous to Lemma H 

Lemma 5. Let S,T be two labeled ordered trees, and let S{u,i,j) and T{v,k,l) 
be ordered forests in S and T respectively. Lf there is an alignment between S 
and T which uses at most d blank symbols (spaces) and consists of an alignment 
between S{u, i,j) and T{v, k, 1) and an alignment between S{u, i,j) and T{v, k, 1) 
then {S{u,i, j),T{v,k,l)) is d-relevant for S and T. 

The next three lemmata will be useful for bounding the number of d-relevant 
pairs of subforests from above. Their proofs are analogous to the corresponding 
proofs of Lemmata^^l 

Lemma 6. Lf the pairs {S{u,i, j),T{v)) and {S{u,i, j),T{w)) are d-relevant for 
two ordered trees S,T and w is a descendant (or, ancestor) of v inT then T{w) 
is a 2d-descendant (or, 2d-ancestor) ofT(v). 



Lemma 7. For a node v of an ordered tree T , the number of d-ancestors of the 
form T{w) of a forest T(y) is at most d. 

Lemma 8. Let {{S{u,i, j),T{vq)}g^Q be a sequence of distinct d-relevant pairs 
in two ordered trees S, T such that for any 0 < q' , q" < I, Vq' is not a descendant 
of Vqii . Then, I < 2d holds. 

By combining the three lemmas above, we obtain an upper bound on the 
number of d-relevant pairs {S{u),T{v,k,l)) and {S{u,i, j),T(v)) as in Theo- 
rem ^ 

Theorem 2. For two ordered trees S, T and a node u of S, the number of dis- 
tinct d-relevant pairs of the form {S{u,i, j),T{v)) is 0(d^(deg(S'))^). Symmet- 
rically, for a node v of T, the number of distinct d-relevant pairs of the form 
{S{u),T{v,k,l)) js 0(d^(deg(T))^). Consequently, there are 0{n ■ dffmaxdegY) 
d-relevant pairs of the form {S{u),T{v,k,l)) or (S{u,i, j),T{v)) for S,T, where 
n = max{|5'|, |T|}. 
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3 Constructing the d-Relevant Pairs 

The test for c?-relevance for a pair of subtrees can easily be accomplished in 
constant time after appropriate preprocessing. However, in order to speed up 
the at least quadratic algorithm of Jiang at al., we cannot afford testing each 
possible subtree pair for d-relevance. Instead, we proceed as follows. 

First, we compute all vectors (|T[?;]|, |L(T, 'c)|), where v G T. This can be 
done in linear total time by using the Eulerian tour technique from We 
insert the vectors into a standard range search data structure, e.g., a layered 
range tree Q. The construction of the data structure takes 0{\T\ ■ log |T|) time. 
Then, for all u in 5 we compute the vectors (|S'[u]|, \L(S, u)|) in linear total time 
in the same way as above. For each u in S, we query the data structure with 
the square centered at (|S'[u]|, 1^(5, u)\) having side length 2d. Each query takes 
0(log liSI + r) time, where r is the number of reported vectors. Since each of the 
returned vectors is in one-to-one correspondence with a node v such that the 
pair (u, v) is d-relevant, r = 0{d?‘) holds by Theorem^ 

Putting everything together, we obtain the following theorem. 

Theorem 3. For two ordered trees on at most n nodes each and a non-negative 
integer d, all d-relevant pairs of subtrees can be reported in 0{n(logn-\-d^)) time. 

We can use the same technique to precompute all pairs of d-relevant sub- 
forests. In fact, for our purposes it is sufficient to report all pairs of d-relevant 
subforests where at least one of the subforests is complete, i.e., is of the form 
S{u) or T{v). To report all d-relevant pairs of the form (S{u),T{v,k,l)), the 
number of vectors to insert into the layered range tree is 0{\T\(maxdeg)^) 
since 0{{maxdeg)^) ordered forests of the form T{v,k,l) originate from each 
node V in T. Thus, the construction time becomes 0{\T\ ■ (maxdeg)'^ ■ log(|T| • 
(maxdeg)^)) = 0{n- (maxdeg)'^ -logn). The number of queries to the data struc- 
ture is 0(151), and the query time is 0(log(|T| • (maxdeg)'^) -I- r) = 0{logn-\- r) 
time, where the sum of the r’s over S is 0{nd? ■ (maxdeg)^) by Theorem^ The 
reporting of d-relevant pairs of the form (5(u, z, j), T(u)) can be done symmet- 
rically within the same (in terms of n) preprocessing and query time bounds. 

Summing up, we obtain: 

Theorem 4. For two ordered trees on at most n nodes each and a non-negative 
integer d, all d-relevant pairs of subforests, where at least one subforest is com- 
plete, can be reported in 0{n ■ (maxdeg)'^ ■ (logn -|- d^)) time. 

4 The Fast Algorithm 

Our fast algorithm for an optimal alignment between two ordered trees works 
under the assumption that there is an optimal alignment between the trees 
S, T which uses at most d blank symbols (spaces). First, we compute all d- 
relevant pairs of subtrees of S, T as described in Section ^ As each d-relevant 
pair is reported, we insert it into a balanced binary search tree B\. Next, all 
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d-relevant pairs of subforests in which at least one subforest is complete are 
computed and inserted into a balanced binary search tree 82 - According to 
Theorems^ andfl there are 0{ncP{maxdeg)^) d-relevant pairs of subtrees or 
subforests where at least one subforest is complete, so this preprocessing takes 
0{n- {max deg) ^ • (log n-l- d^ ) -I- n • d‘^{maxdeg)‘^ ) log(n • d‘^{maxdeg) ^ ) ) = 0{n log n ■ 
{maxdeg)^ ■ d^) time by Theorems Jand^ Then, we modify the algorithm of 
Jiang et al. recursively evaluating the score values (see Q) solely for d-relevant 
pairs of subtrees or pairs of subtrees where one of the subtrees is empty. 

This evaluation involves also recursive evaluation of the score values for d- 
relevant pairs of subforests where one of the forests is complete or empty. In 
fact, the recursive procedures in Q include also intermediate terms with the 
scores values for pairs of subforests when none of the forests is complete or 
empty. However, these intermediate terms are eliminated by the composition 
of the aforementioned procedures, resulting in recursive formulas for the score 
values expressed in the form of maximum of some sums of score values for pairs 
of smaller subtrees or subforests where at least one of the subforests is com- 
plete or empty. Whenever the left handside is d-relevant in the application of 
such a formula, the components of the sum on the right hand side yielding the 
maximum, with the exception of the scores for the pairs including an empty 
subtree or subforest, also have to be d-relevant. Therefore, before an application 
of such a formula to an evaluation of a d-relevant pair, we simply test each of 
the components of the sums on the right handside, which is not a score for pair 
containing an empty subtree or subforest, for membership in Bi or B 2 - Such a 
membership query takes O(logn) time. If the test is positive we fetch the score 
value for the argument pair which should be evaluated by this time, otherwise 
we set that score value to minus infinity. The score values for pairs containing 
an empty subtree or subforest can be trivially precomputed in time OdS”! -I- |T|). 
We conclude that the cost of determining the score for a d-relevant pair on the 
left handside of such a recursive formula on the basis of the scores for d-relevant 
pairs occuring on its right handside does not exceed the cost of determining 
the scores for this pair on the basis of the scores of pairs occuring on the right 
handside in the algorithm of Jiang et al. multiplied by O(logn). 

Jiang et al. show that the cost of determining the score for a pair of subtrees 
or subforests by using the aforementioned formulas and already computed scores 
for pairs of smaller subtrees or subforests is 0 {deg{z) ■ {maxdeg)'^), where z is 
a node in S or T which is either the root of the first subtree or the second 
subtree, or the parent of the roots of the trees in the first forest or the second 
forest. Hence, the corresponding cost for d-relevant pairs in our modification 
of this algorithm is 0{deg{z) ■ {maxdeg)'^ ■ logn). By Theorems B s^nd H for a 
given node z in S' or T, there are 0{d^{maxdeg)^) d-relevant pairs of subtrees 
of the form (S[z], T[u]) or (S[m], T[z]), or subforests of the form (S(z, i,j),T[v]) 
or {S{u),T{z,l,k)). Hence, our modified algorithm runs in 0{J2z^sut ’ 

{maxdeg)^ -logn ■ d^{maxdeg)^) time, i.e., 0 {nd^{maxdeg)'^ log n) assuming the 
preprocessing has been done. 
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Theorem 5. An optimal alignment of two ordered trees which uses at most d 
blank symbols can be constructed in 0(n log n • (maxdeg)'^ ■ df) time. 

Under the natural assumption that the score of a pair including at least one 
blank symbol is negative and by 17(1) smaller than that of a pair consisting of 
two identical symbols, we immediately obtain the following lemma. 

Corollary 1. An optimal alignment of two ordered trees whose score is 0{d) 
apart from the score of the perfect alignment between the first tree and its copy 
can be constructed in 0{nlogn ■ (maxdeg)'^ ■ d^) time. 

Corollary 2. An optimal alignment of two ordered trees of bounded degree whose 
score is 0{d) apart from the score of the perfect alignment between the first tree 
and its copy can be constructed in 0{nlogn ■ df) time. 

5 Final Remarks 

An optimal alignment between two sequences whose score is at most d apart 
from that of a perfect alignment between the first sequence and its copy can 
be constructed in 0{nd) time Since a sequence can be interpreted as a line 
ordered tree with node labels, a natural question arises: is it possible to lower 
the time complexity of our method, especially the exponent 2 of d? 

Our method does not seem to generalize to include unordered trees directly. 
Simply, the proof of Lemmajrelies on the ordering of the trees (i.e., on the 
sets L( , )). It is an interesting open problem whether a substantial speed-up 
in the construction of an optimal alignment between similar unordered trees of 
bounded degree is achievable. 

In the construction of the d-relevant pairs we could use more sophisticated 
and more asymptotically efficient data structures for two dimensional range 
search on an integer grid However, this would not lead to an improvement 
of the overwhole asymptotic time complexity of our alignment algorithm. 
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Abstract. We study the parameterized complexity of the problem to 
reconstruct a binary (evolutionary) tree from a complete set of quartet 
topologies in the case of a limited number of errors. More precisely, we 
are given n taxa, exactly one topology for every subset of 4 taxa, and 
a positive integer k (the parameter). Then, the Minimum Quartet In- 
consistency (MQI) problem is the question of whether we can find an 
evolutionary tree inducing a set of quartet topologies that differs from the 
given set in only k quartet topologies. MQI is NP-complete. However, 
we can compute the required tree in worst case time 0(4*^ • n + «“*) — 
the problem is fixed parameter tractable. Our experimental results show 
that in practice, also based on heuristic improvements proposed by us, 
even a much smaller exponential growth can be achieved. We extend the 
fixed parameter tractability result to weighted versions of the problem. 
In particular, our algorithm can produce all solutions that resolve at 
most k errors. 



1 Introduction 

In recent years, quartet methods for reconstructing evolutionary trees have re- 
ceived considerable attention in the computational biology community QQ. In 
comparison with other phylogenetic methods, an advantage of quartet methods 
is, e.g., that they can overcome the data disparity problem (see Q for details). 
The approach is based on the fact that an evolutionary tree is uniquely char- 
acterized by its set of induced quartet topologies Q. Herein, we consider an 
evolutionary tree to be an unrooted binary tree T in which the leaves are bijec- 
tively labeled by a set of taxa S. A quartet, then, is a size four subset {a, 6, c, d} 
of 5, and the topology for {a, 5, c, d} induced by T simply is the four leaf subtree 
of T induced by {a, 5, c, d}. The three possible quartet topologies for {a, 5, c, d} 
are [o6|cd], [ac|6d], and [ad|6c]| E.g., the topology is [o6|cd] when, in T, the 
paths from a to b and from c to d are disjoint. The fundamental goal of quartet 

* Work supported by the DFG projects “KOMET,” LA 618/3-3, and “OPAL” (opti- 
mal solutions for hard problems in computational biology), NI-369/2-1. 

^ The fourth possible topology would be the star topology, which is not considered 
here because it is not binary. 
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methods is, given a set of quartet topologies, to reconstruct the corresponding 
evolutionary tree. The computational interest in this paradigm derives from the 
fact that the given set of quartet topologies usually is fault-prone. 

In this paper, we focus on the following, perhaps most often studied opti- 
mization problem in the context of quartet methods. 



Minimum Quartet Inconsistency (MQI) 

Input: A set S of n taxa and a set Qs of quartet topologies such that 
there is exactly one topology for every quartet sej corresponding to S 
and a positive integer k. 

Question: Is there an evolutionary tree T where the leaves are bijective- 
ly labeled by the elements from S such that the set of quartet topologies 
induced by T differs from Q$ in at most k quartet topologies? 



MQI is NP-complete Concerning the approximability of MQI, it is known 
that it is polynomial time approximable with a factor It is an open 

question of i i whether MQI can be approximated with a factor at most n or 
even with a constant factor. The parameterized complexity Q of MQI, however, 
so far, has apparently been neglected — we close this gap here. Assuming that 
the number k of “wrong” quartet topologies is small in comparison with the 
total number of given quartet topologies, we show that MQI is fixed parameter 
tractable; that is, MQI can be solved exactly in worst case time 0(4^n -|- n^). 
Observe that the input size is O(n^). It is worth noting here that the variant of 
MQI where the set Qs is not required to contain a topology for every quartet is 
NP-complete, even if fc = 0 Q. Hence, this excludes parameterized complexity 
studies and also implies inapproximability (with any factor). 

To develop our algorithm, we exhibit some nice combinatorial properties 
of MQI. For instance, we point out that “global conflicts” due to erroneous 
quartet topologies can be reduced to “local conflicts.” The basis for this was laid 
by Bandelt and Dress Q. This is the basic observation in order to show fixed 
parameter tractability of MQI. Our approach makes it possible to construct 
all evolutionary trees that can be (uniquely) obtained from the given input by 
changing at most k quartet topologies. This puts the user of the algorithm in 
the position to pick (e.g., based on additional biological knowledge) the probably 
best, most reasonable solution or to construct a consensus tree from all solutions. 
Moreover, our method also generalizes to weighted quartets. 

We performed several experiments on artificial and real (fungi) data and, 
thereby, showed that our algorithm (due to several tuning tricks) in practice runs 
much faster than its theoretical (worst case) analysis predicts. For instance, with 
a small k (e.g., k = 100), we can solve relatively large (n = 50 taxa) instances 
optimally in around 40 minutes on a LINUX PC with a Pentium III 750 MHz 
processor and 192 MB main memory. 

A full version (containing all proofs) is available Q. 



^ Note that given n species, there are = 0{n'^) corresponding quartet topologies. 





Minimum Quartet Inconsistency Is Fixed Parameter Tractable 243 



2 Preliminaries 

Minimum Quartet Inconsistency. In order to find the “best” binary tree for 
a given set of quartet topologies, we can ask for a tree that violates a minimum 
number of topologies. In case we are given exactly one quartet topology for 
every set of four taxa, this question gives the MQI problem. If there is not 
a quartet topology for necessarily every set of four taxa, Ben-Dor et al. Q 
propose two solutions, namely, a heuristic approach and an exact algorithm. The 
heuristic solution is based on semidefinite programming and does not guarantee 
to produce the optimal solution, but has a polynomial running time. The exact 
algorithm uses dynamic programming for finding the optimal solution and has 
exponential running time, namely, 0(m3"), where n is the number of species 
and m is the number of given quartet topologies. Note that Ben-Dor et al. 
run all their experiments on MQI instances, i.e., there was exactly one quartet 
topology for every set of four taxa. In that case, we have m = O(n^). The memory 
requirement of their exact solution is 0(2”). According to Jiang et al. there 
is a factor n^-approximation, and, at the same time, they asked about better 
approximation results. Note that the complement problem of MQI, where one 
tries to maximize \Qt H Q\ (Qt being the set of quartet topologies induced by 
a tree T), possesses a polynomial time approximation scheme 



Some Notation. Assume that we are given a set of n taxa S. For a quartet 
{a, &, c, d} C S', we refer to its possible quartet topologies by [ab\cd], [ac|&d], and 
[ad I be] . These are the only possible topologies up to isomorphism. A set of quartet 
topologies is complete if it contains exactly one topology for every quartet of S. 
A complete set of quartet topologies over S we denote by Qs. A set of quartet 
topologies Q is tree- consistent Q if there exists a tree T such that for the set Qt 
of quartet topologies induced by T, we have Q C Qrp. Set Q is tree-like Q if there 
exists a tree with Q — Qt. Since an evolutionary tree is uniquely characterized 
by the topologies for all its quartets 0, a complete set of topologies is tree- 
consistent iff it is tree-like. A set of topologies has a “conflict” whenever it is not 
tree-consistent. We will call a conflict “global” when a complete set of topologies 
is not tree-consistent. We call it “local” when a size three set of topologies, which 
necessarily is incomplete, is not tree-consistent. 



3 Global Conflicts Are Local 

Given a complete set of quartet topologies which is not tree-consistent, the results 
of Bandelt and Dress B imply that there already is a subset of only three quartet 
topologies which is not tree-consistent. This is the key to developing a fixed 
parameter solution for the problem: It is sufficient to examine the size three 
sets of quartet topologies and to recursively branch on those sets which are not 
tree-consistent, as will be explained in Section^ 
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Proposition 1. (Proposition 2 in Given a set of taxa S and a complete set 
of quartet topologies Qs over these taxa, Qs is tree-like iff the following so-called 
substitution property holds for every five distinct taxa a, b, c,d,eG S: 

[a6|cd] G Qs implies [a6|ce] G Qs or [aejcd] G Qs- 

In the following, we show that in Proposition J we can replace the substitu- 
tion property introduced by Bandelt and Dress with the more common term 
of tree-consistency. This is because, for an incomplete set of only three topolo- 
gies, the substitution property is tightly connected to the tree-consistency of the 
topologies. We will state this in the following technical LemmasnandH(proofs 
omitted, see ^9) and later use it to give, in Theorem J another interpretation 
of Proposition B 

Lemma 1. Three topologies involving more than five taxa are tree- consistent. 

When searching for local conflicts, LemmaH makes it possible to focus on the 
case of three topologies involving only five taxa. If the substitution property, as 
given in Proposition^ is not satisfied, we say that the topologies for the quartets 
{a, b, c, d}, {a, 6, c, e}, and {a, c, d, e} contradict the substitution property. 

Lemma 2. For a given a set of taxa S, three topologies consisting of taxa from S 
are tree- consistent iff they do not contradict the substitution property. 

Note that LemmaBnvolving a necessarily incomplete set of three topologies does 
not generalize from size three to an incomplete set of arbitrary size, as exhibited 
in the following example. For taxa {a, 6, c, d, e, /}, consider the incomplete set 
of topologies [ab\cd\, [o6|ce], [6c|(ie], [cd|e/], and [a/|de]. Without going into the 
details, we only state here that these topologies are not tree-consistent, although 
there are no three topologies which contradict the substitution property. 

Theorem Jnow will make it clearer that “global” tree-consistency of a com- 
plete set of topologies reflects in “local” tree-consistency of every three topologies 
taken from this set. 

Theorem 1. Given a set of taxa S and a complete set of quartet topologies Qs 
over S, Qs is tree-like (and, thus, tree- consistent) iff every set of three topologies 
from Qs is tree- consistent. 

Proof. Due to Lemmas we may replace the substitution property in Proposi- 
tion^with tree constistency. This gives the result. □ 

When we have a complete set of topologies Qs for a set of taxa S, we do not 
necessarily know whether the set is tree-like or not. If it is not, we can, according 
to Theorem^ track down a subset of three topologies that is not tree-consistent. 
Our goal will be to detect all these local conflicts. This will be the preprocessing 
stage of the algorithm that will be described in Section H in order to (try to) 
“repair” the conflicts in a succeeding stage of the algorithm. We can And all these 
local conflicts in time 0(n®) as follows. Since, following Lemma H only three 
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topologies involving five taxa can form a local conflict, it suffices to consider all 
size five sets of taxa {a, b, c, d, e} C S. There are five quartets over this size five 
set of taxa, namely, {a, b, c, d}, {a, b, c, e}, {a, &, d, e}, {a, c, d, e}, and {b, c, d, e}. 
For the topologies of these quartets, we can test, in constant time, whether there 
are three among them that are not tree-consistent. Doing so for every size five 
set, we will, if Q$ is not tree-consistent, certainly obtain a size three subset of 
Qs which is not tree-consistent. Moreover, from Lemma^we know that we find 
all these local conflicts in time 0(n®). 

We can improve this time bound for the preprocessing stage of the algorithm 
to be described in Section^with the following result by Bandelt and Dress Q. 
They show that it is sufficient to restrict our attention to the size five sets 
containing some arbitrarily fixed taxon /. 

Proposition 2. (Proposition 6 in Qj Given a set of taxa S, a complete set of 
quartet topologies Qs, and some taxon f G S, then Qs is tree-like iff every size 
five set of taxa which contains f satisfies the substitution property. 

Following Proposition H we can select some arbitrary f G S and examine only 
the size five sets involving /. Similar to our procedure described above, we con- 
sider every such size five set containing / separately. Among the topologies over 
this size five set, we search the size three sets which are not tree-consistent. If the 
set of quartet topologies Qs is not tree-consistent, we will find a size three set 
of quartet topologies which is not tree-consistent. Finding these local conflicts 
which involve / can be done in time O(n^). 

4 Combinatorial Characterization of Local Conflicts 

Given three topologies, we need to decide whether they are tree-consistent or 
not. Directly using the definition of tree-consistency turns out to be a rather 
technical, troublesome task, since we have to reason whether or not a tree topol- 
ogy exists that induces the topologies. Similarly, it can be difficult to test, for the 
topologies, whether or not they contradict the substitution property. To make 
things less technical and easier to grasp, we subsequently give a useful combi- 
natorial characterization of local conflicts, i.e., three topologies which are not 
tree-consistent. Note that in the following definition, we distinguish two possible 
orientations of a quartet topology [a6|cd], namely, [a6|cd], with a,b on its left 
hand side and c, d on its right hand side, and [cc?|a6], with the sides interchanged. 

Definition 1. Given a set of topologies where each of the topologies is assigned 
an orientation, let I be the number of different taxa occurring in the left hand 
sides of the topologies and let r be the number of different taxa occurring in the 
right hand sides of the topologies. The signature, then, is the pair {I, r) that, over 
all possible orientations for these topologies, minimizes 1. 

Theorem 2. Three quartet topologies are not tree- consistent iff they involve five 
taxa and their signature is (3,4) or (4,4). 
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Fig. 1. Possible trees for [a6|c(i] and taxon e in the proof of Theorem^ 



Proof. {=>) We show that, given three topologies ti, t 2 , and which are not 
tree-consistent, they involve five taxa and have signature (3,4) or (4,4). From 
Lemma n we know that three topologies are not tree-consistent iff they con- 
tradict the substitution property. To recall, three topologies contradict the sub- 
stitution property if, for one of these topologies, w.l.o.g., ti = [ab\cd], neither 
the topology ^2 for quartet {a, 6, c, e} is [a6|ce] nor the topology for quartet 
{a, c, d, e} is [ae|cd]. Therefore, the topology ^2 is either [ac|5e] or [ae|6c], and 
the topology is either [ac|de] or [ad|ce]. By exhaustively checking the possible 
combinations, we can find that the topologies involve five taxa and their signa- 
ture is (3,4) (e.g., for t 2 = [ac|6e] and = [ac|de]) or (4,4) (e.g., for t 2 = [ac|6e] 
and ^3 = [ad|ce]). 

(<J=) We are given three topologies, t\, t 2 , and involving five taxa and 
having signature (3,4) or (4,4). Assume that they are tree-consistent. Showing 
that this implies signature (2, 3) or (3, 3), we prove that the assumption is wrong. 
For tree-consistent ti, ^ 2 , and we can find a tree inducing them. With, w.l.o.g., 
taxa {a, 6, c, d, e} and ti = [o6|cd], we mainly have two possibilities: we can 
attach the leaf e on the middle edge of topology ti, as shown in FigureHa), or 
we can attach e on one of the four side branches of ti, as exemplarily shown in 
FigureH^b). Considering the sets of quartet topologies induced by these trees, we 
find, in each case, that the set has signature (3,3). For instance, the topologies 
induced by the tree in Figure^a) are, besides [a5|ce], [a5|de], [ae|cd], and 
[6e|cd]. Three topologies selected from these have signature (3,3) (e.g., [a6|cd], 
[a5|ce], and [ae|cd]) or (2,3) (e.g., [a&|cd], [a6|ce], and [o6|de]). □ 

Using Theorem 5 we can determine whether three topologies are conflicting by 
simply counting the involved taxa and computing their signature. 



5 Fixed Parameter Algorithm for MQI 

In this section, we present a recursive algorithm solving MQI with parameter k. 
Before calling the recursive part for the first time, one has to build the list of size 
three sets of quartets whose topologies are not tree-consistent. The preparation 
of this conflict list is explained in Section H After that, we call the recursive 
procedure of the algorithm with argument k. 
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The recursive procedure selects a local conflict to branch on from the conflict 
list. This branching is done by changing one topology from the selected local 
conflict, updating the conflict list, and calling the recursive procedure with ar- 
gument fc — 1 on the thereby created subcases. We will later explain how to select 
and change the topologies when branching. After a topology t has changed, the 
algorithm updates the conflict list as follows: It (1) removes the size three sets 
of quartets in the list whose topologies are now tree-consistent, and (2) adds 
the size three sets of quartets not in the list whose topologies now form a local 
conflict. 

The recursion stops if no conflicts are left in the conflict list (we have found a 
solution) , or if fc = 0 (in case the conflict list is not empty, we did not And a so- 
lution in this branch of the search tree) . When a solution is found, the algorithm 
outputs the current set of topologies, i.e., a complete set of quartet topologies 
that is tree-like and that can be obtained by altering at most fc topologies in the 
given set of topologies. From this tree-like set of quartet topologies, it is possible 
to derive the evolutionary tree in time O(n^) [^. Thus scanning the whole search 
tree, we And all solutions that we can obtain by altering at most fc topologies. 



Running Time. For establishing an upper bound on the running time, we 
consider the preprocessing, the update procedure, and the size of the search 
tree. The preprocessing can be done in time 0(n'^), as explained in Section^ 
Updating the conflict list can be done in time 0{n): Following Lemma J 
local conflicts can only occur among three topologies consisting of no more than 
five taxa. Therefore, having changed the topology of one quartet {a, 5, c, c?}, we 
only have to examine the “neighborhood” of the quartet, i.e., those sets of five 
taxa containing a, b, c, d. For every such set of five taxa, it can be examined in 
constant time whether for three topologies over the five taxa, a new conflict 
emerged, or whether an existing conflict has been resolved. Given taxa a, 6, c, d, 
we have n — 4 choices for a fifth taxon. Thus, 0{n) is an upper bound for the 
update procedure! 

Now, we consider the search tree size. By a careful selection of subcases to 
branch into, we can And a way to make at most four recursive calls on an arbi- 
trarily selected local conflict, i.e., for every three topologies which are not tree- 
consistent. Let ti, t 2 , and be three topologies which are not tree-consistent, 
and let, w.l.o.g., U = [o6|c(i] . Following LemmaJ the topologies involve only one 
additional taxon, say e. Following LemmaH ts contradict the substitution 
property. Given ti = [ab\cd\, the substitution property requires topology [a&|ce] 
or topology [ae|cd]. Therefore, we can, w.l.o.g., assume the following setting for 
three quartets contradicting the substitution property: Topology t\ = [a6|cd], 
topology t 2 is the topology for quartet {a, &, c, e} different from [a6|ce], and 

® In fact, as explained in Section H we only consider sets of five species containing 
a designated taxon /. Therefore, if we change the topology of a quartet {a,b,c,d} 
which does not contain the designated taxon /, then we only have to consider one 
set of five topologies, namely, {a, b, c, d, /}. In this special case, the update procedure 
can be done in time 0(1). 
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topology ^3 is a topology for quartet {a, c, d, e} different from [ae|cd]. In order 
to change the three topologies to satisfy the substitution property, we have the 
following possibilities. We can change ti; either (1) we change t\ to [ac|6d], or (2) 
we change t\ to [ad|5c]. Otherwise, we can assume that t\ is not changed. Then, 
we have to (3) change ^2 to [o6|ce] or (4) change to [ae|cd], because these are 
the only remaining possibilities to satisfy the substitution property. Since the 
height of the search tree is at most k, the preceding considerations justify an 
upper bound of 4^ on the exponential growth and yield the following theorem, 
which summarizes our findings. 

Theorem 3. The MQI problem can he solved in time 0(4^ • n + n^). 

Note that this running time is not only true for the algorithm reporting one 
solution, but also for reporting all evolutionary trees satisfying the requirement. 
Our algorithm has 0{kn^) memory requirement, where the input size is already 
O(n^). The correctness of the algorithm follows easily from Theorem^ 

6 Improving the Running Time in Practice 

Besides improving the worst case bounds on the algorithm’s running time, we 
can also extend the algorithm in order to improve the running time in practice 
without affecting the upper bounds. In this section, we collect some ideas for 
such heuristic improvements. 



Fixing Topologies. It does not make sense to change a topology which, at 
some previous level of recursion, has been altered, or for which we explicitly 
decided not to alter it. If we decide not to alter a topology in a later stage of 
recursion, we call this fixing the topology. This avoids redundant branchings in 
the search tree. 



Forcing Topologies to Change. It might be possible to identify topologies 
which necessarily have to be altered in order to find a solution. We call this 
forcing a topology to change. The ideas described here are similar to those used 
in the so-called reduction to problem kernel for the 3-Hitting Set problem 

Lemma 3. Consider an instance of the MQI problem in which quartet q has 
topology t. If there are more than 3fc distinct local conflicts which contain t then, 
in a solution for this instance, the topology for q is different from t. 

Proof. In Section H we showed that three topologies only can form a local con- 
flict if there are not more than five taxa occurring in them (see Lemma H- 
For five taxa, there are five quartets consisting of these taxa, e.g., for taxa 
{a,b,c,d,e} the quartets are {a,b,c,d}, {a,b,c,e}, {a,b,d,e}, {a,c,d,e}, and 
{b, c, d, e}. Therefore, when given two quartet topologies ti and t 2 , we make the 
following observations. If there are more than five taxa occurring in ti and t 2 , 
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they cannot form a conflict with a third topology. If there are exactly five taxa 
occurring in and t 2 , then there are five quartets consisting of these five taxa, 
two of which are the quartets for ti and t 2 - The remaining three topologies are 
the only possibilities for a topology that could form a conflict with ti and t 2 - 
Now, consider the situation in which, for a quartet topology t, we have more 
than 3fc distinct local conflicts which contain t. From the preceding discussion, 
we know that for any t' , there are at most three topologies such that t and t' 
can form a conflict with it. Consequently, there must be more than k distinct 
topologies that occur in a local conflict with t. We show by contradiction 
that we have to alter topology t to And a solution. Assume that we can And a 
solution while not altering t. By changing a topology t' , we can cover at most 
three conflicts, since there are at most three local conflicts containing both t 
and t' . Therefore, by changing k topologies, we can resolve at most 3fc local 
conflicts. This contradicts our assumption and shows that we have to alter t to 
And a solution. □ 



Recognizing Hopeless Situations. Now, we describe situations in which, at 
some level in the search tree when we are allowed to alter at most k topologies, 
we can recognize that we cannot And a solution. Thus, we can “cut off,” i.e., 
omit, complete subtrees of the search tree. 

Having a local conflict consisting only of fixed topologies, we obviously cannot 
resolve this conflict while not changing one of the fixed topologies. As another 
observation, we know that for a solution, we have to change the forced topologies. 
If after identifying these forced topologies, there are more than k of them, it is 
obvious that a solution is not possible — already by changing these topologies, 
we would change more topologies than we are allowed to. 

The following two lemmas contain more involved observations. Their proofs 
use similar ideas as used in the proof of Lemman(see If a local conflict 
does not contain a topology which is forced to change, then we call it an unforced 
local conflict. 

Lemma 4. Let us have an instance of the MQI problem in whieh we have iden- 
tified p eonfiicts whieh are forced to change. If the number of unforced local 
conflicts is greater than S{k — p)k, then the instance has no solution. 



Lemma 5. An instance of the MQI problem in which the number of local con- 
flicts is greater than 6(n — 4)fc has no solution. 



Clever Branching. Applying the rules described above will also significantly 
improve our situation when branching. For the general branching situation on 
a local conflict, we have shown in Section H that it is sufficient to branch into 
four subcases. Regarding topologies forced to change, we can, however, reduce 
the number of subcases. When we have identified a topology t which is forced 
to change, it is sufficient to branch into two subcases: one for each alternative 
topology of t. Regarding fixed topologies, we can take advantage of local conflicts 
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which contain fixed topologies. Having a local conflict with one or two fixed 
topologies, we omit the subcases which change a fixed topology. This will reduce 
the number of subcases to three, two, or even one subcase. 



Preprocessing by the Q*-Method. The algorithmic improvements described 
above do not sacrifice the guarantee to find the optimal solutions. Using these 
improvements, we will find every solution that we would find without them. This 
is not true for the following idea. We propose to use the Q*-method described 
by Berry and Gascuel H as a preprocessing for our algorithm. The Q*-method 
produces the maximum subset of the given quartet topologies that is tree-like. 
In the combined use with our algorithm, we fix these quartet topologies from 
the beginning. Therefore, our algorithm will compute the minimum number of 
quartet topologies we have to change in order to obtain a tree-like set of topolo- 
gies that contains the topologies fixed by the Q*-method. The tree we obtain 
will be a refinement of the tree reported by the Q*-method which may contain 
unresolved branches. Thus, we cannot guarantee that the reported tree is the 
optimal solution for the MQI problem. On real data, however, it is the opti- 
mal tree with high certainty: Suppose it is not. Then there are four taxa a,b,c, 
and d that are arranged in another way by the Q*-method than they would be 
arranged in the optimal solution for the MQI problem. As we are working on a 
complete set of topologies, this would imply that there are at least n — 3 quartets 
that would make the same wrong prediction for the arrangement of a, 6, c, d: the 
quartet {a, 5, c, d} and, for all e S 5 — {a, 5, c, c?}, one quartet over {a, 6, c, d, e} 
that involves e. On real data, this is very unlikely. Our experiments described in 
Section^support the conjecture that with the preprocessing by the Q*-method, 
we find every solution that the MQI algorithm would find. Moreover, the exper- 
iments show that this enhancement allows us to process much larger instances 
than we could without using it. 



7 Related Problems 

We now come to some variants and generalizations of the basic MQI problem 
and their fixed parameter tractability. These variations arise in practice due to 
the fact that often quartet inference methods cannot non-ambiguously predict a 
topology for every quartet. Perhaps the most natural generalization of MQI is 
to consider weighted quartet topologies. 

Weighted MQI. Weights arise since a quartet inference method can predict 
the topology for a quartet with more or less certainty. Therefore, we can assign 
weights to the quartet topologies reflecting the certainty they are predicted with. 
Given a complete set of weighted topologies Qs and a positive integer k, we 
distinguish two different questions. 

1. Assume that we are given a complete set of weighted topologies Qs, with 
positive real weights, and a positive integer k. A binary tree is a candidate 
for a solution if the set of quartet topologies induced by this tree differs from 
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Qs in the topologies for at most k quartets. Can we, among all candidate 
trees satisfying this property, find the one such that the topologies in Qs 
which are not induced by the tree have minimum total weight? 

The algorithm in Sectionjcan compute all solution trees. So, we can, with- 
out sacrificing the given time bounds, find this tree among the solution trees 
for which the “wrong” quartet topologies have minimal total weight. 

2. Assume that we are given a complete set of weighted topologies Qs, each 
topology having a real weight > 1, and a positive real K. Is there a binary 
tree such that the quartet topologies induced by the tree differ from the 
given topologies only for topologies having total weight less than K? 

Again, we can use the algorithm presented in Section Q When branching 
into different subcases, the time analysis of the algorithm relied on the fact 
that in each subcase at least one quartet topology is changed, i.e., added to 
the “wrong” topologies. In the current situation of weighted topologies with 
weights > 1, each subcase changes quartet topologies having a total weight 
of at least 1. The time analysis of our algorithm is, therefore, still valid and 
the time bounds remain the same. 

Allowing arbitrarily small weights in question H the problem cannot be fixed 
parameter tractable, unless P = NP. To see this, take an instance of unweighted 
MQI with parameter k. We can turn this instance into an instance of weighted 
MQI by assigning all topologies weight 1/k and setting the parameter to 1. A 
fixed parameter algorithm for the problem with arbitrary weights > 0 would thus 
give a polynomial time solution for MQI, which contradicts the NP-completeness 
of MQI unless P = NP. Having, however, weights of size at least e for some 
positive real e, the problem is fixed parameter tractable as we described here for 
the special case that e = 1 (similar to Weighted Vertex Cover in ^3)- 
Underspecified MQI. Due to lack of information or due to ambiguous results, 
a quartet inference method may not be able to compute a topology for every 
quartet, so there may be quartets for which no topology is given. Assuming a 
bounded number of quartets with missing topology, we formulate the problem as 
follows. Given a set S of taxa, integers k and k' , and a set of topologies Qs, such 
that Qs contains quartet topologies for all quartets over S except for k' many. 
Then, we ask whether there is a binary tree such that the quartet topologies 
induced by the tree differ from the given topologies only for k topologies. 

The set of topologies is “underspecified” by k’ topologies. We can solve the 
problem as follows. Having three possible topologies for each quartet, we can, for 
a quartet without given topology, branch into three subcases, one for each of its 
three possible topologies. Having selected a topology for each such quartet, we 
run the algorithm from Section^ The resulting algorithm has time complexity 
0(3^ ■ ■ n + n^) and shows that the problem is fixed parameter tractable for 

parameters k and k' . Note that for unbounded k' this problem is NP-complete 
even for k = 0 and, therefore, is not fixed parameter tractable. 

We only briefly mention another variant of MQI, Overspecified MQI: In that 
problem, we are, compared to MQI, given an additional integer k” and two 
topologies instead of one for k” many quartets. For these quartets, we are free to 
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choose one of the given topologies. In a similar way as for underspecified MQI, we 
can show that overspecified MQI is fixed parameter tractable for parameters k 
and k” . 

8 Experimental Evaluation 

To investigate the usefulness and practical relevance of the algorithm for un- 
weighted MQI, we performed experiments on artificial as well as on real data 
from fungi. The implementation of the algorithm was done using the program- 
ming language C. The algorithm contains the enhancements described in Sec- 
tion^ The combined use with the Q*-method was, however, only applied when 
processing the fungi data, not when processing the artificial data. The reported 
tests were done on a LINUX PC with a Pentium III 750 MHz processor and 192 
MB main memory. 

8.1 Artificial Data 

We performed experiments on artificially generated data in order to find out 
which kind of data sets our algorithm can be especially useful for. For a given 
number n of taxa and parameter k, we produce a data file as follows. We generate 
a random evolutionary tree for n taxa and derive the quartet topologies from that 
tree. Then, we change k distinct, arbitrarily selected topologies in a randomly 
chosen way. This results in an MQI instance that certainly can be solved with 
parameter k. For each pair of values for n and k, ten different data sets were 
created. The reported results are the average for test runs on ten data sets. 

We experimented with different values of n and fc. As a measure of perfor- 
mance, we use two values: We report the processing time and, since processing 
time is heavily influenced by system conditions, e.g., memory access time in case 
of cache faults, also the search tree size. The search tree size is the number of the 
search trees nodes, both inner nodes and leaves, and it reflects the exponential 
growth of the algorithm’s running time. 

FigureHa) gives a table of results for different values of n and k. Regarding 
the processing time, we note, on the one hand, the increasing time for fixed n and 
growing k. On the other hand, we observe that for moderate values of k, we can 
process large instances of the problem, e.g., n = 50 and k = 100 in 40 minutes. 
For comparison of the algorithm’s performance, consider the results reported 
by Ben-Dor et al. Q, who solve MQI instances also giving guaranteed optimal 
results. They only report about processing up to 20 taxa and list, admittedly 
for a high number of erroneous topologies, a running time of 128 hours for this 
case (on a SUN Ultra-4 with 300 MHz). 

In Figure Hb) we compare, on a logarithmic scale, the theoretical upper 
bound of 4^ to the real size of the search tree. For each fixed number of taxa n, 
we give a graph displaying the growth of search tree size for increasing k. The 
search trees are, by far, smaller than the 4^ bound. This is mainly due to the 
practical improvements of the algorithm (see Section We also note that for 
equal value of k, a higher number n of taxa often results in a smaller search tree. 
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(a) (b) 

Fig. 2. Comparing running time and search tree size for different values of n 
and k. 



8.2 Real Data 

Using our algorithm, we analyzed the evolutionary relationships of species from 
the mushroom genus Amanita, a group that includes well-known species like the 
Fly Agaric and the Death Cap. The underlying data are an alignment of nuclear 
DNA sequences coding for the D1/D2 region of the ribosomal large subunit 
(alignment length 576) from Amanita species and one outgroup taxon, as used 
by Weifi et al. We inferred the quartet topologies by (1) using dnadist 
from the Phylip package Q to compute pairwise distances with the maximum 
likelihood metric, and (2) using distquart from the Phyloquart package | to 
infer quartet topologies based on the distances. 

The analysis was done by a preprocessing of the data using the Q*-method, 
also taken from the Phyloquart package. Experiments on small instances, e.g., 
10 taxa, show that all solutions we find without using the Q*-method are also 
found when using it. Using the Q*-method, however, results in a significant 
speed-up of the processing. Figure H^a) shows this impact for small numbers 
of Amanita species. Note, however, that the speed-up heavily depends on the 
data. In Figure in the following, we neglect the time needed for the 

preprocessing by the Q*-method, which is, e.g., 0.11 seconds for n = 12. 

We processed a set of n = 22 taxa in 35 minutes. The resulting tree was 
rooted using the outgroup taxon Limacella glioderma and is displayed in Fig- 
ure Hb). We found the best solution for k = 979 for the given 7315 quartet 
topologies. The Q*-method had fixed 41 percent of the quartet topologies in ad- 
vance. Considering the tree, the grouping of taxa is consistent with the grouping 
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running 


time in sec. 


n 


no Q* 


with Q* 


8 


0.46 


0.36 (21%^ 


9 


3.41 


0.85 (32%) 


10 


35.96 


2.68 (38%) 


11 


617.56 


4.11 (41%) 


12 


7039.82 


5.44 (43%) 



“ Percentage of quartet 
topologies fixed by Q* 





Vaginatae 



A. fulva 
A. nivalis 
A. vaginata 
A. ceciliae 
A. Caesarea I 

A. longistriata I 

A. incarnatifolia I 

A. mira 
A. gemmata 
A. pantherina 
A. muscaria 
A. solitaria I 
A. japonica | 

A. fuliginea | 

A. subjunquillea I Pha 
A. phalloides I 

A. excelsa I , 

. . Validae 

A. citrina I 

A. avellaneosquamosa i 
A. volvata I 

A. clarisquamosa | 
Limacella gliodenna 



Lepidella 



Lepidella 



(a) 



(b) 



Fig. 3. (a) Speed-up when using Q* preprocessing, (b) Optimal tree found for a 
set of 21 Amanita species and one outgroup taxon; indicated is the grouping of 
Amanita species into 7 sections and 2 subgenera. 



into seven sections supported by WeiB et al. who used the distance method 
neighbor joining, heuristic parsimony methods, and maximum likelihood estima- 
tions. Particularity, our grouping is nearly identical to the topology revealed by 
Weifi et al. using maximum likelihood estimation. This topology is well compati- 
ble with classification concepts based on morphological characters, e.g., the sister 
group relationship of sections Vaginatae and Caesareae, and the monophyly of 
subgenus Amanita. 

One might hope that quality of quartet inference techniques will improve in 
the future. This would lead to instances requiring smaller values of k. 

9 Conclusion 

We showed that the Minimum Quartet Inconsistency problem can be solved in 
worst case time 0(4^n -|- n"^) when parameter k is the number of faulty quartet 
topologies. This means that the problem is fixed parameter tractable. Several 
ideas for tuning the algorithm show that the practical performance of the al- 
gorithm is much better that the theoretical bound given above. This is clearly 
expressed by our experimental results. Note that there is an ongoing discussion 
about the usefulness of quartet methods: St. John et al. give a rather crit- 
ical exposition of the practical performance of quartet methods (in particular, 
quartet puzzling) in comparison with the neighbor joining method, which is in 
opposition to results reported by Strimmer and v. Haeseler Q. 

Concerning future work, we want to extend our experiments to weighted 
quartet topologies and to other data. Also, the fact that we can obtain all op- 
timal and near-optimal solutions and the usefulness of this deserves further in- 
vestigation. From a parameterized complexity point of view, it remains an open 
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question to find a so-called reduction to problem kernel (see | ' ' | for details). 

The further reduction of the tree size concerning theoretical, as well as experi- 
mental bounds, is a worthwhile future challenge. 
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Abstract. Lattice protein models are used for hierarchical approaches 
to protein structure prediction, as well as for investigating principles of 
protein folding. The problem is that there is so far no known lattice that 
can model real protein conformations with good quality, and for which 
there is an efficient method to prove whether a conformation found by 
some heuristic algorithm is optimal. We present such a method for the 
FCC-HP-Model ||. For the FCC-HP-Model, we need to find conforma- 
tions with a maximally compact hydrophobic core. Our method allows us 
to enumerate maximally compact hydrophobic cores for sufficiently great 
number of hydrophobic amino-acids. We have used our method to prove 
the optimality of heuristically predicted structures for HP-sequences in 
the FCC-HP-model. 



1 Introduction 

The protein structure prediction is one of the most important unsolved prob- 
lems of computational biology. It can be specified as follows: Given a protein 
by its sequence of amino acids, what is its native structure? NP-completeness of 
the problem has been proven for many different models (including lattice and 
off-lattice models) These results strongly suggest that the protein folding 

problem is NP-hard in general. Therefore, it is unlikely that a general, efficient 
algorithm for solving this problem can be given. Actually, the situation is even 
worse, since the general principles why natural proteins fold into a native struc- 
ture are unknown. This is cumbersome since rational design is commonly viewed 
to be of paramount importance e.g. for drug design, where one faces the difficulty 
to design proteins that have a unique and stable native structure. 

To tackle structure prediction and related problems simplified models have 
been introduced. They are used in hierarchical approaches for protein folding 
(e.g., K9, see also the meeting review of CASP3 where some groups have 
used lattice models). Furthermore, they have became a major tool for investi- 
gating general properties of protein folding. 

Most important are the so-called lattice models. The simplifications com- 
monly used in this class of models are: 1) monomers (or residues) are represented 

* Supported by the PhD programme “Graduiertenkolleg Logik in der Informatik” 
(GKLI) of the “Deutsche Forschungsgemeinschaft” (DFG). 
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using a unified size 2) bond length is unified 3) the positions of the monomers 
are restricted to lattice positions and 4) a simplified energy function. 

In the literature, many different lattice models (i.e., lattices and energy func- 
tions) have been used. Examples of how such models can be used for predict- 
ing the native structure or for investigating principles of protein folding were 
given Of course, the question arises which lattice and en- 

ergy functions has to be preferred. There are two (somewhat conflicting) aspects 
that have to be evaluated when choosing a model: 1) the accuracy of the lat- 
tice in approximating real protein conformations, and the ability of the energy 
function to discriminate native from non-native conformations, and 2) the avail- 
ability and quality of search algorithm for finding minimal (or nearly minimal) 
energy conformations. 

While the first aspect is well-investigated in the literature (e.g., the 

second aspect is underrepresented. By and large, there are mainly two different 
heuristic search approaches used in the literature: 1) Ad hoc restriction of the 
search space to compact or quasi-compact conformations (a good example is 
Q, where the search space is restricted to conformations forming an n x n x n- 
cube) . The main drawback here is that the restriction to compact conformation 
is not biologically motivated for a complete amino acid sequence (as done in 
these approaches), but only for the hydrophobic amino acids. In consequence, 
the restriction either has to be relaxed and then leads to an inefficient algo- 
rithm or is chosen to strong and thus may exclude optimal conformations. 2.) 
Stochastic sampling like Monte Carlo methods with simulated annealing, genetic 
algorithms etc. Here, the degree of optimality for the best conformations and the 
quality of the sampling cannot be determined by state of the art methodsj 

On the other hand, there are only three exact algorithms known 
which are able to enumerate minimal (or nearly minimal) energy conformations, 
all for the cubic lattice. However, the ability of this lattice to approximate real 
protein conformations is poor. For example, ^ pointed out especially the parity 
problem in the cubic lattice. This drawback of the cubic lattice is that every two 
monomers with chain positions of the same parity cannot form a contact. 

In this paper, we follow the proposal by B to use a lattice model with a 
simple energy function, namely the HP (hydrophobic-polar) model, but on a 
better suited lattice (namely the face-centered cubic). There are two reasons for 
this approach: 

1) The FCC can model real protein conformations with good quality (see ^3, 
where it was shown that FCC can model protein conformations with coordinate 
root mean square deviation below 2 A) 

2) The HP-model models the important aspect of hydrophobicity. Essentially it 
is a polymer chain representation (on a lattice) with one stabilizing interaction 
each time two hydrophobic residues have unit distance. This enforces compacti- 
fication while polar residues and solvent is not explicitly regarded. It follows the 



^ Despite there are mathematical treatments of Monte Carlo methods with simulated 
annealing, the partition function of the ensemble (which is needed for a precise 
statement) is in general unknown. 
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assumption that the hydrophobic effect determines the overall configuration of 
a protein (for a definition of the HP-model, see . 

Once a search algorithm for minimal energy conformations is established 
for this FCC- HP-model, one can employ it as a filter step in an hierarchical 
approach. This way, one can improve the energy function to achieve better bio- 
logical relevance and go on to resemble amino acid positions more accurately. 

Contribution of the Paper. In this paper, we present the first algorithm for enu- 
merating maximal compact hydrophobic cores in the face-centered cubic lattice. 
For a given conformation of the FCC-HP-model, the hydrophobic core is the set 
of all positions occupied by hydrophobic (H) monomers. A hydrophobic core is 
maximally compact if the number of contacts between neighbored positions is 
maximized. Thus, a conformation which has a maximally compact hydrophobic 
core has minimal energy in the HP-model. 

There are mainly two applications of the algorithm for finding hydrophobic 
cores. The first is that it provides a method to check minimality of conformations 
found by an heuristic algorithm. We have used an heuristic algorithm described 
earlier y. For the first time, we were able to find minimal energy conformations 
(and to prove their optimality) for HP-sequences in the FCC-HP-model. So far, 
the only known results for the FCC-HP-models were approximation results with 
an guaranteed ratio of 60% (Q, Q provides a general approximation scheme 
for HP-models on arbitrary lattices; Q gives an approximation scheme for the 
HP-model on the cubic lattice). 

The second application is that the hydrophobic cores are a promising inter- 
mediate step for an algorithm to enumerate all minimal energy conformations. 
This technique has already been used successfully in Q. 



2 Preliminaries 



For a vector p, we denote with p^ (resp. Py or Pz) its a;-coordinate (resp. y- 
or z-coordinate) . We use a transformed representation of the FCC-lattice (for 
a detailed description, see Q. We define the FCC-isomorphic lattice D3 to be 
the lattice that consists of the following sets of points: D3 = {(^v^ \ G 

and X even} l±) I and x odd}. The first set consist of the 

points in even x-layers, the second of the points in odd x-layers. The set Nd>^ of 
minimal vectors connecting neighbors in ZJg is given by ^ ±1 )■(!)}« 

I ^ ±0-5^ I ■ The vectors in the second set are the vectors connecting neighbors in 



two successive x-layers. Two points p and p' in Dg are neighbors if p — p' G . 

A coloring is a function / : Dg — > {0, 1}, where /“^(l) 7^ 0. We will identify 
a coloring / with the set of all points colored by /, i.e. {p \ f{p) = 1}. Hence, for 
colorings /i, /2 we will use standard set notation for size |/i|, union /i U /2, dis- 
joint union /itt)/2, and intersection /in/2. Given a coloring /, we define the num- 
ber of contacts of f by con(/) := i |{(p,p') | /(p) A /(p') A (p - p') e A^d^}|- 
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A coloring / is called a coloring of the plane x = cif f{x, y,z) = 1 implies x = c. 
We say that / is a plane coloring if there is a c such that / is a coloring of 
plane x = c. We define Surfp/(/) to be the surface of / in the plane x = c, i.e. 
Surfp/(/) = \{{p,p') I {p-p') G A f{p) A^f(p') Ap'a: = c}. With Tain^{f) 
we denote the integer minjpa; | p € /}. maxa;(/), minp(/), maxy(/), minj;(/) 
and max^(/) are defined analogously. 



3 Description of the Method 



Our aim is to determine maximally compact hydrophobic cores. A hydrophobic 
core is just a coloring /. A maximally compact hydrophobic core for n points is 
a coloring / of n points that maximizes con(/). Without loss of generality, we 
can assume that min 2 ,(/) = 1. Let k = max 2 ,(/). Then, we partition / into plane 
colorings /i, . . . , fk of the layers x = 1, . . . , a; = k. For searching a maximal 
coloring /, we do a branch-and-bound search on k and fi ■ ■ ■ fk- 

Of course, the problem is to give good bounds that allow us to cut off many 
k and fi ■ ■ ■ fk that will not maximize con(/i l±) . . . ttl fk)- For this purpose, we 
distinguish between contacts in a single layer (= con(/d for 1 < i < fc), and 
interlayer contacts for 1 < z < fc between two successive layers (i.e., pairs 

(p, p') such that p and p' are neighbors, p G fi and p' G fi+l)- We then give two 
different bounds on the layer and interlayer contacts, provided some parameters 
restricting the ffs. 

For every plane coloring fi, these parameters are the size Ui of fi, the num- 
ber Qi of rows that contain a point of fi, and the number bi of columns that 
contain a point of fi- Given these parameters, it is known that the layer 
contacts of fi are given by 2n, — Qi — bi. In this paper, we present for any set 
of parameters ni,ai,bi and nz+i an upper bound on the number of interlayer 



contacts 



> max 






fi satisfies Ui,ai, bi 
and \fi+i \ = Ui+i- 



So far, the only related bound was given in our own work Q. Although 
there a bound ^ was given, this bound does not hold for arbitrary sets of 

parameters ni,ai,bi and nz+i. Instead, the bound is valid for sufficiently filled 
plane colorings (called normal), which was sufficient for the purpose of Q. 

The bound ^ is used in searching for a maximally compact core for 

n H-monomers as follows. Instead of directly enumerating k and all possible 
colorings /i ttl . . . W /fc, we search through all possible sequences of parame- 
ters ((n-i, oi, bi) . . . [uk, Ok, bk)) with the property that n = J^i ^y using the 
h ’ ^ layer sequences have to be considered further. For these 

optimal layer sequences, we then search for all admissible colorings /i W . . . W /fc. 

For calculating the bound b ^ need to introduce additional param- 

eters, namely the number of non-overlapping and unconnected rows in layer 
X = i. These additional parameters allow us to determine the maximal number 
of interlayer contacts between layer x = i and x = z -I- 1. Further note that 
only few combinations of (rzz, Oi, bf) and these additional parameters are admis- 
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sible. Thus, for every (rn, ai,bi), we search through all admissible numbers of 
non-overlapping rows in layer x = i to determine ^ . . 

In Section J we define the parameters of a plane coloring and determine 
which combinations of parameters are admissible. In Section ^ the number of 
interlayer contacts is given provided the parameters and the number of points 
with three interlayer contacts, called 3-points, is fixed. In the following section, 
we determine the number of 3-points that maximizes the interlayer contacts. 

4 Properties of Overlapping and Non-overlapping 
Colorings 

Let / be a coloring of plane x = c. A horizontal caveat in f is a, k-tuple of 
points (pi,...,pfc) such that VI < j < fc : {{pj+i - Pj)y = 1), {Pi,Pk} G / 
and yi < j < k : pj ^ f. A vertical caveat in f is defined analogously satisfying 
VI < j < fc : ((Pj+i — Pj)z = 1) instead. We say that / contains a caveat if there 
is at least one horizontal or vertical caveat in /. / is called caveat-free if it does 
not contain a caveat. We will handle only caveat-free colorings. The methods 
can be extended to treat caveats as well, but we suppress them for simplicity. 

We now introduce the parameters of a plane coloring / that will allows us to 
determine layer and to bound interlayer contacts. The first set of parameters are 
the rows and columns occupied by /. For an arbitrary plane coloring f ot x = c 
define occz{f,z) := 3y : f{c,y,z) and occy(/, y) := 3z : f{c,y,z). Furthermore, 
we define oylines(/) := | { y|occy(/, y) } | and ozlines(/) := | { 2 |occz(/, z) } | . 
For notational convenience define olines(/) := (oylines(/), ozlines(/)). For a col- 
oring /, we call rows z, where occz(/, z) holds, and columns y, where occy(/, y), 
occupied, and unoccupied otherwise. 

For a plane coloring /, we define the layer contacts LC/ to be con(/). We 
define 



LC„,a,6 := max 



LC/ 



/ is a coloring of plane x = c 
A/ has lines (a, 6) A |/| = n 



Proposition 1. For every caveat-free coloring f with olines(/) = {a,b), we get 
LC„,a.h = 2n- iSurfp/(/) and Surfp/(/) = 2(a -I- b). 

Proof (Sketch). Each of the n points colored by / has 4 neighbors, which are 
either occupied by another point, or by a surface point. Hence, 4n = 2LC„^a,6 + 
Surfp/(/). For the second claim, note that by definition, every occupied row and 
column must generate 2 surface contacts, and, by caveat-free, there can be no 
more than 2. □ 

The second set of parameters are the number of unconnected and non- 
overlapping rows. Let / be a coloring of plane x = c. We define a row z to 
be non- overlapping in f if z is occupied, there is an occupied row z' > z, and 
there is no y such that /(c, y, z) A /(c, y, z -\- 1). A row z is called unconnected 
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if it is non-overlapping and not 3y, y' : /(c, y, z) A /(c, y' , z + \) f\\y — y'\ < 1. 
The number of non-overlapping rows is denoted by #non-overlaps(/) and the 
number of unconnected rows by #non-connects(/). 



Distance 1 Distance > 1 





<-> 




< > 


a) 


4 


> • • b) 


• 



• • • • • • 

• ••• •••• 

Fig. 1. a) Non-overlapping vs. b) unconnected. 

To illustrate the terms, FigureHi) shows a coloring with #non-overlaps(/) = 
1 and #non-connects(/) = 0, whereas the coloring in Figure satisfies that 
#non-overlaps(/) = 1 and #non-connects(/) = 1. 

We will call a coloring / with ^non-overlaps(/) = 0 overlapping (otherwise 
non-overlapping). A coloring with #non-connects(/) = 0 is called connected 
(otherwise unconnected). 

In the rest of this section, we give precise bounds on the number of col- 
ored points, given the parameters of the plane coloring. We will first state 
some properties of colorings with respect to olines(/), #non-overlaps(/) and 
#non-connects ( /) . 

Proposition 2. For every caveat-free coloring f we have |/| > max(olines(/)). 

Since by definition the maximal occupied row z can not be non-overlapping 
we immediately get that #non-overlaps(/) is less than oylines(/). The next 
lemma states in addition that #non-overlaps(f) is less than ozlines(/). Intu- 
itively, this is a consequence of the (non-trivial) fact that every non-overlapping 
row produces exactly one non-overlapping column. 

Lemma 1. For a caveat-free coloring f, we get 

#non-overlaps(/) < min(olines(/)). 

A caveat-free coloring can be split at non-overlapping rows into sub-colorings 
with the nice property that the parameters of the coloring can be calculated 
from the sub-colorings in a simple way. This fact will be employed for inductive 
arguments. Given a plane coloring / and a row min 2 (/) < Zs < m.ax.z{f), we 
define foz^ = {{c,y,z) G / | zOzs} for 6 G {<,>}. Note that the restriction 
on Zs is required, since splitting at row Zs = maxj,(/) would produce an empty 
sub-coloring f^zg ■ Further note that this restriction is trivially satisfied by any 
non-overlapping row. 

Lemma 2 (Split). Let f be a caveat-free coloring of the plane x = c with 
^non-overlaps(/) > 1, and let Zg be a non- overlapping row. Then, 

1. f = f<zs ty f>z„ o,nd the sub-colorings f<z„ o,nd f^z„ o,xe caveat-free 
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2. olines(/) = (oylines(/<^J + oylines(/> 2 j, ozlines(/< 2 j + ozlines(/>^J) 

3. #non-oveiiaps(/) = #non-overlaps(/<^J + #non-oveiiaps(/> 2 j + 1. 



a 




rrino 

Fig. 2. Coloring with maximal number of elements. 



There is a dependency of the admissible numbers #non-oveiiaps(/) and the 
number of elements in a coloring /, given the number of occupied lines in 
y and z direction. Think of (a, b) (resp. mno) as representing olines(/) (resp. 
^non-overlaps(/)). We define n„iax(o, h, m„o) := nino +(o — mno)(f* — nino) and 
?^min(a, h, m„o) := a + 6 — 1 — m„o. The idea of the definition of nmax(a, h, m„o) is 
that the number of elements is maximized if we have one big overlapping region 
and waste as little space as possible for the non-overlapping region. Hence, in 
this maximal coloring, all of the non-overlapping rows contain exactly one point. 
Such a coloring is shown in Figure H 

Lemma 3. All caveat-free colorings f satisfy \ f\ < n.max(a, b, mno), where mno = 
#non-overlaps(/) and {a,b) = olines(/). 

Lemma 4. For all caveat-free colorings f holds nniin(a, nino) < |/|, where 
(a, b) = olines(/) and mno = #non-overlaps(/). 

Proof (Sketch) . For the case mno = 0, a coloring / of plane x = c with minimal 
number of points and olines(/) = (a, b) is given by the coloring that has b points 
(c, 1, 1) ... (c, 1, 6) in the column y = 1 , and a points (c, 1, 6) . . . (c, a, b) in the 
row z = b. Clearly, / has a -I- 6 — 1 points since (c, 1,6) is in the first column and 
last row. For mno > 0, the claim follows by induction using the split lemmaO 
ClaimH □ 

For convenience, we define the following bounds on the number of non- 
overlapping rows: 

nOmin(n, a, 6) := min{mno | 0 < lUno < min(a, 6) - 1 A n > Uminfa, 6, lUno)} 
nOmax(n, a, 6) := max{mno | 0 < lUno < min(a, 6) - 1 A n < nmax(a, 6, mno)} 

Proposition 3. For any caveat-free coloring f with olines(f) = (a,b) and \ f\ = 
n holds nOniin(?^, a, 6) < #non-overlaps(/) < nOmax(»T-, a, 6). 
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5 Number of i-Points for Cavent-Pree Colorings 



In the next two sections, we will provide a bound on interlayer contacts. For this 
purpose, we calculate for a coloring / of plane c the numbers of points having 
4,3,2, and 1 contacts to / (in the following called z-points). Theorem^will state 
that we can achieve the maximal number of interlayer contacts between x = c 
and a; = c + 1 if we fill the 4-points first, then (if points are left) the 3-points 
and so on. Before, we need some definitions and auxiliary lemmata. 

In the following, let / be a plane coloring of plane x = c and f a plane 
coloring of plane x = c' , where c ^ c' . We define the number of interlayer eontaets 
of f and f by IC ^ = con(/ l±) /') — LC / — LC // . We define contactSmax(/, n) as 

max I ICj |/^ is a plane coloring of a; = c -I- 1 with |/'| = n | . 

A point p is called a 4:-point for / if p is in plane a; = c-|-lora; = c— 1 and p 
has 4 neighbors Pi, - Pi € /. Analogously, we define 3-points, 2-points and 1- 
points. Furthermore, we define #4c_i(/) = |{p | p 4-point for / in a; = c — 1}|. 
Analogously, we define #4c+i(/) and #Zc±i(/) for z = 1, 2, 3. We will show that 
the number of z-points for every z € {1,2, 3, 4} depend only on the number of 
non-overlaps, the number of non-connects, and the number of x-steps. An x-step 
for a plane coloring / is a triple {p\,P 2 .,Pa) such that /(pi) = 0, /(P 2 ) = 1 = 

/(pa). Pi — P 2 = ± and pi — pa = ± ^ 0 ^ . With xsteps(/) we denote the 
number of x-steps of /. Now we can define the number of z-points, depending 
on n = I/I, s = Surfp/(/), rux = xsteps(/), mno = #non-overlaps(/) and m„c = 
#non-connects ( /) : 



= n- -S+1+ mno 



- 4 mno - mn 



# 4 (""’" 

T^^(m„.,ni„.,m,) = ~ 2(m„o - m„c) 

W2m„o-b2m„c-b4 

For preparation, we state two lemmas that investigate how to calculate the 
z-points of / from the two sub-colorings generated by splitting / at a non- 
overlapping or unconnected row. 

Lemma 5 (Split 3-Points). Let f be a caveat-free coloring of plane x = c 
with /^:non-overlaps(/) > 1, and let Zs be a non- overlapping row. Then, #3(/) = 
#3(/<zJ+#3(/>.J. 

Proof (Sketch) . We can show that neither f<z ^ , nor , nor / has a 3-point 
that lies between rows Zs and Zg -\- 1. This implies that every 3-point for / is 
either below and is therefore also a 3-point for f<zs i or above 2s + 1 and is 
therefore also a 3-point for f>z^- n 
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Lemma 6 (Split at Minimal Unconnected Row). Let f be a caveat- free 
coloring of plane x = c with #non-connects(/) > 1, and let Zs be the mini- 
mal unconnected row. Then, #non-connects (/<2 J = 0, #non-connects(/> 2 j = 
#non-connects(/) — 1 and 

xsteps(/< 2 j +xsteps(/> 2 j = xsteps(/), (1) 

V*e{l,2,3,4}: + = #*(/). (2) 

Proof (Sketch). The first two claims are trivial. For claims Q and Q, one 
shows that if Zg is unconnected, the y-distance between points in f<z, and f>z„ 
is always greater than 1. This implies that the sets of z-points and a;-steps of 
f<zs /> 2 ^ are disjoint. □ 



Lemma 7. Let f be a caveat-free coloring. Then 



Vze{l,2,3,4}:#z(/) = #z( 



l/I.SurC,(/) 

^non-overlaps (/) ,^non-connects(/) ,xsteps(/) 



)■ 



Proof (Sketch). The case #non-connects(/) = 0 is equivalent to the formula 
already proven in Q. For the case #non-connects(/) = m„c > 0, we do induction 
on mnc- The claim for #4(/), #3(/) and #!(/) follow from the Split-Lemmata 
QQand Ody simple calculation (recall that by definition, every unconnected 
line is also non-overlapping). For #2(/), the claim follows by simple calculation 
from the equation 4#4(/) -|- 3#3(/) -I- 2ff2{f) -\- 1^1(/) = 4|/|. This equation 
holds since the sum of all interlayer contacts between / and the next plane is 

4|/|. □ 



6 Maximal Number of 3-Points 

Due to the last lemma, if we consider colorings with given n, a, b, lUno, and lUnc, 
then TUx does not affect the number of 4-points, but increases the number of 
3-points and 1-points, while decreasing the number of 2-points. The increase of 
3- and 1-points is 1 per x-step, the decrease of 2-points is 2 per 3-point. This 
pattern grants that we maximize the possible number of interlayer contacts to 
a second plane with a given number of elements, if we maximize the number of 
3-points in the first plane. For this purpose, we first show that we need not to 
distinguish between unconnected and non-overlapping rows for the number of 3- 
points. The reason is that number of 3-points does not change if one transforms 
a non-overlapping row into into a unconnected row. Consider as an example the 
two colorings 
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Then both / and f' have one 3-point (indicated in grey). By transforming the 
non-overlapping row in / into a unconnected row, /' looses two x-steps. Thus, 
the effects of increasing ^j(flnon-connects(-) by 1 are diminished by decreasing 
xsteps(-) by 2. 

Note that such a bound for the interlayer contacts using a bound for 3-points 
that does not distinguish between non-overlapping and unconnected rows slightly 
overestimates, since we assume the best case for the number of 2- and 1-points 
(note that in contrast to the number of 3-points, the number of 2- and 1-points 
depend on the exact number of unconnected rows). 

We start with the extension of the bound for 
3-points, as given in ^ in the case of “sufficiently 
filled” and overlapping colorings, to arbitrary over- 
lapping colorings. We need to recall some defini- 
tions from y. For an overlapping coloring / with 
olines(/) = (a, 6), a and b are the side lengths of the 
minimal rectangle around the points in / (called 
frame(/) in the following). The detailed frame of a 
Fig. 3. Detained Frame. coloring / is the tuple (a, &, i/6, Vh, Vu), where 

(a, 6) is the frame of / and i/6 is the number of di- 
agonals that can be drawn from the left-bottom corner, are defined 

analogously. For a coloring / with detailed frame (a, &, i/6, i/«, irh, *™), we call 
i = (i/6, i/«, irh, i™) the indent vector of f. As shown in Q, the indent vector 
gives a precise bound on the #3(/), since in this case, xsteps(/) = #3(/) and 
xsteps(/) = i/6 -I- i/u -I- irb + i™ — diagcav(/). Here, diagcav(/) counts the num- 
ber of diagonal caveats, which are defined analogous to vertical and horizontal 
caveats. For example, consider the plane coloring fex as given in FigureJ Then 
the detailed frame of fex is (6, 9, 3, 2, 1, 2). The number of 3-points (indicated by 
x) for fex is8 = 3-|-2-|-l-l-2, since fex does not contain diagonal caveats. 

In the overlapping case, we search for a given number of points n and a frame 
(a, b) the maximal number of i-steps. For this purpose, we define for some indent 
vector i = (ii, i 2 , is, i 4 ), vol(a, &, i) := ab — vol(a, 6, i) is the 

maximal number of points that can be colored by any / that has indent vector 
i and frame (a, 6). i = (ii, i 2 , is, i 4 ) is called maximal for (a, 6) iff X)i<j< 4 *i = 
2(min(a, 6) — 1). For example, if 6 < a, then the indent vector i is maximal 
for (a, 6) if every coloring with frame (a, 6) and indent vector i has exactly one 
colored point in the first and last column. 




vol(a, b, i) can now be used to calculate the maximal number of x-steps that 
can be achieved given n colored points and frame (a, 6). The maximal num- 
ber of x-steps is achieved if we make the indents as uniform as possible. For 
this purpose, define edge(n, a, 6) = max{fc € N | vol(a, 6, (A:, fc, fc, fc))}. k = 
edge(n, a, b) defines the maximal possible uniform indent. Then r = ext(n, a, b) = 



a6-4it^tli_r 
fc + 1 



■J defines the number of times r we can extend the uniform indent 



by 1. n is called normal for (a, b) if either 4fc -|- r < 2(a — 1), or 4fc -|- r = 2(a — 1) 
and ab — _ r(k -|- 1) = n. 
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Now there are two upper bounds that can be given for the number of x-steps, 
given n colored points and frame (a, 6). The first is given by the indent vector. 
The second by the fact, that in caveat-free and overlapping colorings, there may 
be at most between every two successive lines 2 x-steps, which gives at most 
2min(a, b) — 1. Thus, the bound given in Q is as follows: 

xstepS| 3 „jj(n, a, b) = min(4edge(n, a, b) + ext(n, a, 6), 2(min(a, b) — 1)). 

We improve the bound in the case of quadratic frames (a, a) and n is not normal 
for {a, a). Here, we show that we have an upper bound of 2a — 3 instead of 2a — 2 
if there is no maximal indent i with n = vol(a, a, i). We show in this case, that 
there must be a diagonal caveat. 

Lemma 8. For every overlapping caveat-free coloring f we get 



#3(/) < #3bound(|/|, a, 6), 



where (a, b) = frame(/) and 



^3bound(^; b') 



XStepSbnd(*^J 
2 min(a, b) — 2 
< 2a - 2 

2a — 3 

V 



n is normal for frame (a, b) 
else if a b 

else if3i. i are maximal indents 

for (a, a) A n = vol(a, a, i) 

otherwise 



For the general case of possibly non-overlapping colorings. Lemmata 
andjimply that any coloring / with olines(/) = (a, b) and ^non-overlaps(/) = 
mno satisfies valid(n, a, 6, lUno) := (mno < min(a, 5) A nmin(a, b,mno) < n < 
?^max(a, nino)) • Heiice, we define #3bound(»^, a, nino) to be — oo in the case 
that valid(n, a, b, lUno) does not hold. Otherwise, we define #3bound(»T-, a, b, lUno) 
by #3bound(n, a, b) if m„o = 0 and 





#3bound(n',a',6',0) 


1 < n' < n — 1, ) 


max < 


T ^3bound(^ n ,a a , 


1 < a' < a - 1, 




^ 5-6',mno-l) 


1 < 6' < 5 - 1, J 



otherwise. 

Lemma 9. For every caveat-free coloring f , holds ff3{f) <ff3hound{n, a, b, mno), 
where n = |/|, (a, 6) = olines(/), and lUno = #non-overlaps(/). 

Proof (Sketch). The case m„o = 0 is treated in Lemma Q For n, a, b and 
m„o > 0 with valid(n, a, 6, mno), we can split a coloring / at the minimal non- 
overlapping line Zs and into f<^^ and and get #3(/) = #3(/<^J-k#3(/>^J 
by LemmaH Considering all possible rows for splitting will give the second case 

of ^3bound(^5 b, m no) ■ ^ 
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The bound on the number of 3-points can now be used to derive a bound on 
the number of interlayer contacts for arbitrary colorings. Summarizing, we get 
the following bound: 

BNMIC"^_^^_j,^(m„oi) := 4min(n2, #4) -|- 3 min(#3, max(n2 - #4), 0) 

-I- 2 min(#2, max(n2 - #4 - #3, 0)) -I- min(#l, max(n2 - #4 - #3 - #2, 0)) 

where #4 = n - oi - 6i -|- 1 -l-mnoi #3 = #3bound(ni, oi, 5i, m„oi) 

^2 = 2(fli -l- 6i) — 4 — 2^3 — 3 lUnoi = ^3 + 2 mnoi +4. 

Theorem 1. Let fi and f 2 be eoloring of planes x = c and a; = c-|- 1, respee- 
tively. Let ni = |/i|,olines(/i) = (ai,6i), I/2I = U 2 and olines(/2) = (02, &2)- 
Then 

7 Constructing the Compact Cores 

We will now show how to compute the optimally compact cores for a given 
number of elements, thereby employing the given bound on interlayer contacts, 
for a branch-and-bound approach. Due to space restrictions, we have to omit 
many details of the approach. 

W.l.o.g, let a coloring / be decomposed into plane colorings /i W • • • W /fc. 
A dynamic programming algorithm allows one to efficiently compute bounds 
BMC(n, ni,ai,bi) such that for every coloring / = /i W • • • W /fc, it holds that 
BMC(n, ui, oi, 61) > con(/), where |/| = n, |/i| = ni, and olines(/i) = (ai,6i). 
From this algorithm we get immediately a maximal number of contacts in any 
coloring with n elements. Further, let a layer sequence be a sequence of triples 
(ui, ai,bi). A coloring / is called s- compatible, if every plane restriction fi of / 
is compatible to Si = (ni,ai,bi), i.e. \fi\ = Ui and olines(/i) = (ai,bi). 

By traceback from the above dynamic programming algorithm one efficiently 
obtains the set of all layer sequences s, where there may exist (by our bound) an 
s-compatible coloring / with b contacts. That is, we define this set of sequences 
by S{n,b) := { s layer sequence] bound for s greater or equal 6} . 

To find optimally compact colorings it remains to search by constraint based 
search through the colorings of candidate layer sequences. 

Now, we assume that the sets S{n,b) are already precomputed by the dy- 
namic programming algorithm. To find one optimally compact coloring with n 
elements do the following. Let be the contacts bound for colorings with n 
elements. For ascending z > 0, iteratively search for a coloring / with bn — i con- 
tacts in all layer sequences s G S{n,bn — i). Clearly, the first coloring fb found by 
this procedure has maximal contacts. To find all colorings with a given number 
k of contacts (e.g. all best colorings) we perform an analogous search in all layer 
sequences s G S{n,b). 
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Table 1. Search for one optimally compact core with n elements, given a layer 
sequence. We give the number of contacts, as well as nodes and time of the 
constraint search. 



n ^ contacts # search-nodes time in s 



23 


76 


15 


0.1 


60 


243 


150 


0.7 


89 


382 


255 


2.1 


100 


436 


82 


1.2 



Table 2. Sequences L 1 -L 5 (taken from with absolute walks of optimal 

conformations in FCC-HP-model. The steps of the walk are given by points of 
the compass. The -I- and — indices indicate an additional 45° walk out of the 
plane. 



Li HPPPPHHHHPPHPHPHHHPHPPHHPPH : 

EN+ S+ SN~ S+ S~ S~ S+ N~ N~ N+ S~ N+ S+ESN~ SWN+ N+ S+ EN~ 

L2 HPPPHHHHPHPHHPPPHPHHPHPPPHP : 

S~ S~ NS+N+ N~ N~ NS+ N+ SWN+ N~ N~ SS+ S~ S~ WS+ S+N+ ENN+ 

L3 HPHHPPHHPPHHHHPPPHPPPHHHPPH : 

S+EEN- N~ StNEN+ WS~ S+ tVS+ S+BAfy S+ N~ NS~ N+WWSt S~ 

L4 HHPHHPHHPHHHHHHPPHHHHHPPHHHHHHH : 

^7 ^7 ^7 EEN~ SSS+ S+N~ N~ N~ NS+ S+ S+NS+N+ N~ S- S- WS+ S+ S- N~ 

Ls PHPPHPPHPPHPPHPPHPPHPPHPPHPPHPPHPPHP : 

St s~ S+ESN- ES~ EN+ N~ WN+ N~ ES+ N~ S+ S+ ENWN+ WS~ N+ S+ S~ S+ EN~ SS~ NS+ 



8 Results 

We have computed all sets of layer sequences S{n, b) for n < 100 in about 10 
days on a standard PC. For a given layer sequence one optimally compact core 
is usually found within a few seconds by our constraint based search program. 
Some results are shown in Tabled 

We present some of the optimal cores for n = 60 and n = 100 elements in 
Figures J and J The cores are shown as plane sequence representation. This 
representation shows a coloring by the sequence of its occupied i-layers in the 
lattice H3. For each i-layer x = Xq the lower left corner of the grid has coordi- 
nates (xq, 0, 0). The grid-lines have distance 1. The core points in each cc-layer are 
shown as filled circles. There is a noteworthy difference between layers x = Xq, 
where Xq is even and those where it is odd. In the latter ones the points have 
non-integer y and 2 coordinates. 

Further, we folded some proteins of the FCC-HP-model using a program from 
Q to their now proven optimum. The results are shown in Table^ 
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